Extracting information from English text
- Natural language parsing is really easy
- You just use a dozen regexps and fiddle with them until it
appears to work
- Assuming you don't mind some mistakes, of course.
There are three manipulations we perform on programmes:
splitting multipart programmes into constitutents with the
'clumpidx', filling in the programme category based on the
description, and parsing the description to find names of
actors.
(At this point the speaker shall perform a haphazard and
probably broken demonstration by running tv_extractinfo_en on
some input and diffing the results.)
- Hog-butchering algorithm
- Emphasis on reducing false positives in matching
- Still many improvements to be made, but it's a tarpit
- Fun though
- Only works well with UK listings, N. America too terse.
Might add something for German listings if I get time.
Next: the complete set of filters
Edward Avis
Last modified: Thu Mar 14 11:45:04 GMT 2002