A totally unscientific analysis of words appearing in British TV programme titles

Introduction

If you use XMLTV's pick_cgi script to manage your TV viewing, over a few weeks you will build up a 'tvprefs' file listing your preferences. This file lists programme titles together with information about whether you wish to view them: the four choices are yes, no, always, and never. Because this file will probably contain an entry for every programme on television during the time you were using pick_cgi, it's possible to mine it for information about what keywords occur in titles of programmes you are likely to watch, or not watch. Also it is possible to just look at the titles and see which words crop up most commonly.

Method used

To do this, I wrote analyse_tvprefs, included in the xmltv source distribution, which will grind over the preferences file and work out scores for different words. Every time a word appears in the title of a programme marked as yes, it gets a positive score of one point, and no gives a negative score of one point (which is kept separate for the time being - they don't cancel out). always and never give positive two points and negative two points respectively.

Typically, the negative scores will be much higher than the positive ones, unless you are a real addict and have flagged a large number of programmes as yes or always. We normalize by counting the total negative points allocated and the total positive points, and multiplying each programme's positive points score by a constant factor so that the totals end up the same. Once we've done that, the positive and negative scores will be on the same sort of scale, so we can 'add' them (positive score minus negative score) to get an overall score for each word which could be positive or negative (or maybe zero).

Finally, some words are naturally more common in English (or whatever language your TV is in), and we should take account of that. Otherwise words like 'the' would get big negative scores, drowning out more interesting but less frequent words. We want to find the most commonly occurring words in programme titles compared to English as a whole. So the score for each word is divided by that word's frequency in English (from the top 3000 words in the British National Corpus).

Then it's just a case of outputting each word and its score, sorted by score. As well as finding out your own preferences, you might want to see what words most commonly occur in programme titles (whether you watch the show or not). To do this, we just give every word in every title listed a score of +1 for each occurrence, then divide by the total number of points and divide by English word frequency as before.

Usage of analyse_tvprefs program

Assuming you have downloaded and started using XMLTV, go into the analyse_tvprefs/ subdirectory and pipe your tvprefs file into analyse_tvprefs. It will produce output sorted by ascending score - so words you dislike strongly appear at the top, those you like a lot appear at the bottom. To find the commonest words in titles, give the --noprefs option; words will be printed in ascending order of frequency-relative-to-standard-English.

Results

Here are the results from analysing my tvprefs file. This comes from, oh, nearly a year of using XMLTV and pick_cgi to decide what to watch on the five UK terrestrial channels. But I have occasionally downloaded and sifted through digital satellite and radio channels, so their programmes will make up a non-tiny proportion of the file.

Low-scoring words

First find the words I don't favour, those tending to appear in titles of shows I don't watch. Show the 100 lowest-ranked words, worst first.

% ./analyse_tvprefs <../tvprefs | head -100 | perl -pe 's/:.*//' | tr "\n" ' '
u pg racing 2001 grand fm prix awards wild itv f1 bbc movie gallery football live 4music magic update murder sport weather cricket concert dance breakfast celebrity sports don rock championship prince golf x hollywood premiere hell lw eureka lottery meridian zone la die story glory jazz wildlife dream bear stories ice meets planet escape sunday match journey search deadly express killer paradise preview qualifying rescue tennis heaven secret music beach guns headlines kid mix jack treasure holiday weekend affair bond devil diary incredible legends newmarket serial animal 100 mountain bob jim touch 4play bloomberg charlie evil hearts hunt ka

Note that the appearance of a word here doesn't necessarily mean I have a real dislike for it. For example, I have nothing against U-rated or PG-rated movies. It just happens that I've entered a lot of films as never, the word 'U' or 'PG' appears in the title, and most importantly these words are not common in everyday English. OTOH, it is entirely accurate to say that I don't watch many sporting, wildlife, or lottery shows.

High-scoring words

./analyse_tvprefs <../tvprefs | tail -100 | tac | perl -pe 's/:.*//' | tr "\n" ' '
enfield harry s mcgowan alistair fast ll hill star futurama trek impression queen totp simpsons dwarf allo gerry brass anderson guide mad heroes lads ain popstars gentlemen creek animation unforgettable joe alright adam love williams comedy t4 xena towers spaced silvers sf seinfeld reply porridge pmqs plain manga inch impressions homer hiker grimleys frasier festive fawlty erotica clangers bagpuss marian maid homage galaxy crusade chums eye taxi shakespeare phil hitch fabulous compilation tales warrior francisco animated behaving tokyo ted merry guy 1999 television robbie spy princess clock iv san reading gun show naked jonathan presents cat badly bedtime streets

Here I've taken the bottom 100 lines and reversed the order, so that the highest-rated words appear first. It turns out that most of these words correspond to one particular programme - see how many you can pick out!

Words commonly occurring in titles

./analyse_tvprefs --noprefs <../tvprefs | tail -100 | tac | perl -pe 's/:.*//' | tr "\n" ' '
s 4later u pg racing 2000 itv movie grand 2001 fm prix awards wild f1 bbc shooting secret t4 gallery news football adventures classic live star update tales magic 4music murder sport weather cricket secrets concert christmas dance queen t breakfast story stars celebrity tv sports don guide rock ii prince championship golf music x love premiere lw hollywood special hell zone omnibus meridian lottery eureka bedtime 4 sex la heroes fame hill die harry wildlife presents jazz glory angel dream bear stories review ice planet meets escape sunday match journey search tennis rescue qualifying preview paradise naked kiss killer

Again, this takes the last 100 lines - the highest scoring words - and reverses the order to put the highest first. It's often said that British TV is degenerating into an endless diet of sex, violence and celebrity makeover docusoap drivel, and these results seem to back up that point. It's uncanny how the list of words seems to read like one long programme title.

Future work

At present the only data available for analysis is the collection of titles accumulated by pick_cgi. There's no extra information about what times the programmes are on (which words are most common at what time of day?), which channels and which days of the week. Also no detailed contents (how frequently does 'hard-hitting' occur as part of a programme description?) or list of people (are particular presenters associated with particular keywords?). One obvious failing is that the search for common words in programme titles takes no account of how many times a show is broadcast. If it did, I'd expect 'news' to be the highest-ranking word.

So most improvements depend on getting better preferences data. I plan to improve pick_cgi to store richer details about what you like and don't like - for example, to notice favourite actors and highlight other TV shows or films they star in. When I start accumulating a rich audit trail of all television for the past year or so (which could seriously eat disk space), it will be possible to do a more advanced analysis.


Edward Avis
Last modified: Mon Dec 2 21:03:15 GMT 2002