Trying to parse human-readable data

It was just good luck that the first site in the search results had such clean and easy to parse HTML. All there was to find was start time, stop time, title and maybe description. Delimited with <b> and similar stuff. But you are completely at the mercy of the other end - a slight format change can break your program.

A handy technique is to tokenize the HTML page according to appearance. So you end up with a list of tokens with flags for bold, italic, maybe font size and colour and so on. This gives better results than trying to follow the HTML structure because nobody publishes well-structured HTML. Create a lookup table mapping different font styles to semantic elements.

Sure enough the site did change completely and a hurried rewrite was necessary - but more on that later.

Next: the first release
Edward Avis
Last modified: Thu Mar 14 11:43:45 GMT 2002