A machine-readable format
- Want to do at least three things with TV listings: download
them, filter them and print them.
- Need a common machine-readable format.
Why XML?
- Because it's kewl
- Because I don't have to write a parser
- DTD unambiguously
defines the file format
- Validation using nsgmls
- So a good base for exchanging code with others.
I looked for established formats for TV listings, but none was
appropriate. They all had too few features or too much cruft.
- Tend to be one channel only, but for the viewer channel is
almost irrelevant and impedes sorting.
- Correspond too closely to weird internal database formats
(eg Press Association) or to one particular output format (the
German channel SAT.1
distributes listings with the attribute 'von' to say whether
to write 'Wiederholung von' or 'Wiederholung vom'.
- Inch-thick standards full of digital TV classifications and
few concrete examples.
So I decided to make my own format. A simple example is on the
front of the
XMLTV
web page. The aims for the format were:
- Represent listings from the viewer's perspective, not the
broadcaster's.
- Include information you'd find in a printed TV guide.
- Include information needed for particular applications, as
they appear.
- Include all information fetched from the Ananova site (no
details thrown away).
- Simple in the simple case. This means making almost
everything optional.
The present format has many more features than the very first
version. Notably the storage of channels and channel names
following RFC
2838. It is about halfway evolved to the final version,
there are still things to tidy and generalize.
DTD-writing principles:
- Data should go in elements and metadata
in attributes. For XMLTV the metadata is broadcast details, a
programme is 'the same' in some way irrespective of when it is
broadcast.
- Tried not to have multiple forms of the same
information, so made ordering significant. For example when
more than one title is given the first title is more important
somehow. Programmes are listed in the same order as printing.
- All human-readable data has a language given, so
mixed-language no problem
- Don't use ID/IDREF for channels because the weird RFC2838
names are optional.
Next: how to grab data from a website
Edward Avis
Last modified: Sat Aug 23 17:21:34 BST 2008