alan little’s weblog

lazy blogging

15th December 2004 permanent link

A story that turns out to be primarily about superlative customer support for a piece of open source software, with big thanks to Roberto de Almeida the maintainer of pyTextile.

And what does pyTextile do? It takes me one step further in my quest to have home-grown blogging software that fits my needs & wishes like a glove. At the moment AYAWT does everything I want it to with regard to building nicely (?) formatted pages out of a pile of raw entries, publishing them to my hosting service and to a backup, and notifying technorati and various other people. The most laborious bit now – and therefore the next thing to be Simplified – is wrapping my text up in various required bits of administrative xml before I feed it into the AYAWT xml mill. It’s clearly possible to automate this so that I can just feed it plain text with links.

I had a look at various ways I might do this:

  1. Laboriously write my own. Nah. I’m not that stupid.
  2. HTMLTidy. HTMLTidy is intended for cleaning up bad html markup, such as what most Microsoft tools produce. It isn’t designed for converting flat text files to html. But in fact, I discovered that if you feed it a flat text file it does a pretty good job. It has just one fatal flaw: it doesn’t convert a blank line to a paragraph break, so everything ends up in one paragraph. I did a fair amount of option fiddling without finding a way of changing this.
  3. pyTextile. pyTextile is a python port of a well-known perl tool that is expressly designed for producing html from flat text files. It seems to do it rather well, but again when I tested it I found one one fatal flaw: it converts non-ascii characters to numeric codes, ignoring the fact that my files are all nice modern unicode and not bizarre ascii relics of the last century. Furthermore, I would have to wrap this up in an xhtml namespace declaration in order to get it through the rest of my process. I am exceedingly reluctant to even dip a toe into the xml namespace swamp for something that I’m doing on my own time for [some definition of] fun.
  4. ReStructured Text. ReST is another “simplified” markup language . One look at the how-to page makes it clear that this is not something I’m going to learn in half an hour on the train.

I could mess around putting manual paragraph breaks in before or after using HTMLTidy, but that would be putting one foot on the slippery slope to writing my own. Nah. I’m not that stupid. pyTextile seemed like the least flawed option, particularly when I discovered that the perl version apparently has a way to turn the numeric encoding off.

A bit of searching revealed that pyTextile was originally produced by Mark Pilgrim but is now maintained by Roberto de Almeida. There doesn’t seem to be any kind of support forum or mailing list, so I posted a comment on the relevant entry in Roberto’s development blog; without much hope because the blog entry dated from July and it was now the middle of November. But Roberto replied the next day, saying that switching off the numeric encoding for unicode files was already on his to-do list and he would let me have a pre-release version with it in within a week. Which he did. I did a quick test and it didn’t seem to be working, still getting numeric entities. Didn’t have time to look at it in more detail for a couple of weeks due to the Onrush Of Christmas and Life With A Toddler; but when I did, and sent some sample code and files to Roberto, he replied the same day pointing out what I was doing wrong and providing sample code that does work, like this:

import textile

input = open('input.txt').read()
html = textile.textile(input, input_encoding='utf-8', output_encoding='utf-8')
out = open('output.html', 'w')
out.write('<meta http-equiv="Content-Type" content="text/html;charset=utf-8" >\n\n')

This requires a pre-release version of pyTextile which is up on Roberto’s site at As it’s a pre-release version, don’t be surprised if there are bugs.

(This posting not actually produced using pyTextile, although others will be).

related entries: Programming

all text and images © 2003–2008