If you use XMLTV's pick_cgi script to manage your TV viewing, over a few weeks you will build up a 'tvprefs' file listing your preferences. This file lists programme titles together with information about whether you wish to view them: the four choices are yes, no, always, and never. Because this file will probably contain an entry for every programme on television during the time you were using pick_cgi, it's possible to mine it for information about what keywords occur in titles of programmes you are likely to watch, or not watch. Also it is possible to just look at the titles and see which words crop up most commonly.
To do this, I wrote analyse_tvprefs, included in the xmltv source distribution, which will grind over the preferences file and work out scores for different words. Every time a word appears in the title of a programme marked as yes, it gets a positive score of one point, and no gives a negative score of one point (which is kept separate for the time being - they don't cancel out). always and never give positive two points and negative two points respectively.
Typically, the negative scores will be much higher than the positive ones, unless you are a real addict and have flagged a large number of programmes as yes or always. We normalize by counting the total negative points allocated and the total positive points, and multiplying each programme's positive points score by a constant factor so that the totals end up the same. Once we've done that, the positive and negative scores will be on the same sort of scale, so we can 'add' them (positive score minus negative score) to get an overall score for each word which could be positive or negative (or maybe zero).
Finally, some words are naturally more common in English (or whatever language your TV is in), and we should take account of that. Otherwise words like 'the' would get big negative scores, drowning out more interesting but less frequent words. We want to find the most commonly occurring words in programme titles compared to English as a whole. So the score for each word is divided by that word's frequency in English (from the top 3000 words in the British National Corpus).
Then it's just a case of outputting each word and its score, sorted by score. As well as finding out your own preferences, you might want to see what words most commonly occur in programme titles (whether you watch the show or not). To do this, we just give every word in every title listed a score of +1 for each occurrence, then divide by the total number of points and divide by English word frequency as before.
Assuming you have downloaded and started using XMLTV, go into
the analyse_tvprefs/ subdirectory and pipe your tvprefs file
into analyse_tvprefs. It will produce output
sorted by ascending score - so words you dislike strongly appear
at the top, those you like a lot appear at the bottom. To find
the commonest words in titles, give the --noprefs
option; words will be printed in ascending order of
frequency-relative-to-standard-English.
Here are the results from analysing my tvprefs file. This comes from, oh, nearly a year of using XMLTV and pick_cgi to decide what to watch on the five UK terrestrial channels. But I have occasionally downloaded and sifted through digital satellite and radio channels, so their programmes will make up a non-tiny proportion of the file.
First find the words I don't favour, those tending to appear in titles of shows I don't watch. Show the 100 lowest-ranked words, worst first.
% ./analyse_tvprefs <../tvprefs | head -100 | perl -pe 's/:.*//' | tr "\n" ' 'Note that the appearance of a word here doesn't necessarily mean I have a real dislike for it. For example, I have nothing against U-rated or PG-rated movies. It just happens that I've entered a lot of films as never, the word 'U' or 'PG' appears in the title, and most importantly these words are not common in everyday English. OTOH, it is entirely accurate to say that I don't watch many sporting, wildlife, or lottery shows.
Here I've taken the bottom 100 lines and reversed the order, so that the highest-rated words appear first. It turns out that most of these words correspond to one particular programme - see how many you can pick out!
Again, this takes the last 100 lines - the highest scoring words - and reverses the order to put the highest first. It's often said that British TV is degenerating into an endless diet of sex, violence and celebrity makeover docusoap drivel, and these results seem to back up that point. It's uncanny how the list of words seems to read like one long programme title.
At present the only data available for analysis is the collection of titles accumulated by pick_cgi. There's no extra information about what times the programmes are on (which words are most common at what time of day?), which channels and which days of the week. Also no detailed contents (how frequently does 'hard-hitting' occur as part of a programme description?) or list of people (are particular presenters associated with particular keywords?). One obvious failing is that the search for common words in programme titles takes no account of how many times a show is broadcast. If it did, I'd expect 'news' to be the highest-ranking word.
So most improvements depend on getting better preferences data. I plan to improve pick_cgi to store richer details about what you like and don't like - for example, to notice favourite actors and highlight other TV shows or films they star in. When I start accumulating a rich audit trail of all television for the past year or so (which could seriously eat disk space), it will be possible to do a more advanced analysis.