HPLabs Semantic Web Projects 2003

Andy Seaborne <andy.seaborne@hp.com>

Introduction : The Semantic Web

The Semantic Web is an important emerging area for the W3C that is an extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation.  It is the idea of having data on the Web defined and linked in a way that it can be used for more effective discovery, automation, integration, and reuse across various applications.  The Web can reach its full potential if it becomes a place where data can be shared and processed by automated tools as well as by people.

RDF (Resource Description Framework) is a standard for expressing data and metadata from the W3C (see the article RDF and Metadata for more information).  The Semantic Web project in HPLabs has already released an open source toolkit, called Jena, for RDF which includes an API, a parser, a query system, several storage mechanisms and support for ontologies.  The project intends to continue development of this open source toolkit.

Placement applicants are not expected to have experience in RDF or the semantic web.

Hewlett-Packard Laboratories, Bristol

HP Labs is the industrial research laboratories for the whole of Hewlett-Packard.  It provides technological leadership to HP, inventing new technologies that change markets and create new business opportunities.  HP Labs Bristol is the largest HP research facility outside the US and it is the main site in Europe.

The 2003 Student Placement Programme

The Semantic Web project student placement programme for 2003 has up to 5 places.  We are looking for students in both group projects and individual projects.

In group projects, it will be the teams responsibility to refine the problem and develop the work programme in conjunction with the semantic web research team; each group member will take specific areas of responsibility.  In individual projects, it is the individuals responsibility to define the work programme in conjunction with one of the semantic web research team

The list of projects illustrates the work areas; the exact selection of activities will be based on the specific skills and interests of the students accepted and will be chosen in conjunction with the successful applicants.

Students on the HPLB Semantic Web placement programme can expect to be working on the next generation of WWW technologies and gain experience of a upcoming technology area.

Process

The first step is to contact us for more information, without obligation, or to arrange to talk to us when we visit Imperial.  The timescale is:

Group Projects

Creating, managing and display of RDF annotations and Notes

When on a business trip, people take notes of meetings, conference papers, raw ideas, snippets of news etc. These notes are captured as RDF. They mostly are semi-structured. The attendee list of meeting is pretty structured but the raw notes are pretty much free text.  Suppose all these notes are dumped into my personal information store.

Let's say I want to read the notes. I bring up my RDF Browser which starts with some sort of query page. I give it a query - find recent trip reports, and shows me a list prettily presented. I click on the one I want and see a trip report which to a human user. The presentation is excellent; it looks like it might have been hand written, but it is automatically generated from the RDF knowledge base. I can click on links; maybe a meeting attendee, and up comes what I know about that person - other contacts I've had with him, who he works for, etc. I can instruct/allow the browser to pull in more information from other sources as well.

Now imagine that instead of just my little personal knowledge base, there is a great big semantic web out there with lots of information. There will still be human users who want to browse that information. The RDF Browser will allow people to browse the semantic web, getting high quality presentations of that information. They will be able to select the presentation style they like. They will be able to select the depth of presentation they want.

To make this real requires tools for capture of notes in RDF, somewhere to store and manage notes taken and a way to display the information.  It also requires navigation and linking to other information.  A lot of information is available on the web but people have to process it when a machine is quite capable of handling the routine.  Making it accessible to the RDF Browser is a step in getting machines to handle routine information tasks, like finding out where an airport actually is located (one data source) and how to get from their to a hotel (another information task).

Skills: Java, application building, HTML, and related display technologies.

An Online Technology Journal

Technology is moving fast.  The publishing process is slow - it takes 6 months or more to get an article published.  The result has been websites that serve as online journals or technical magazines create communities of interest in specialist topics.  Some RDF technology already exists for content syndication (e.g. RSS - the Rich Site Summary - see http://syndic8.com/ for a syndication site) but this only extends to announcing material, not tying back to the author who wants to get a article published.

We want to create a demonstration online journal that allows researchers and application developers to share their experiences by publishing articles, long and short, about their work.  The content would be accessible through browsing but also by subscribing to interest areas and being notified when a new, relevant articles or news item appeared.

Many of the sites doing this today are very labour intensive and fail because the effort to keep them up-to-date becomes too great.  We want to automate the online process as much as possible using RDF to record the metadata for all content on the site and all content submitted to the site. Various applications would perform much of the mechanical processes to produce the web site and to send out notifications of new material.

This project will require a wide variety of skills in order build a functioning web site, which is easy to keep up-to-date with articles and news items being submitted by authors.

Skills: some of Java, PHP, HTML, databases applications, web site design.

Individual Projects

Prototype Rules Engine for Jena

Rules can be used to capture many of the simple information processing tasks that people perform on a routine basis.  Systems that perform these routine tasks can greatly help people handle information.  Rules are simple descriptions of tasks like "if I have a have a meeting at such-and-such a place, then I need to allocate time to get there" and have been shown to be good at capturing people's intuitions

This project is to prototype a simple rules engine, based on an existing language parser, then build a simple demonstration application mapping between different information resources.  The project can then either develop the rule engine into something more sophisticated or develop the application further.

Skills: Java programming, experience of Prolog.

TAP Integration

The semantic web, like the current web, is about publishing and consuming information.  One system that is focused on the issues of actually serving up RDF data is the TAP system.  We wish to integrate this into the Jena toolkit.  This project is to take the TAP system and provide both a way to work with TAP knowledge sources in the Jena RDF framework and then to build an example system.

Skills: Java programming, knowledge of HTTP an advantage.

RDF/XML Parsing

The leading RDF parser, ARP, is built on a novel co-parser architecture in which the XML parsing and the RDF parsing are in conceptually separate processes. This is realised in Java by using two threads one for each parser. The two-threaded architecture is not fast, and makes error handling difficult. This has presented specific problems to some of ARPs more ambitious users. The leading XML parser, Xerces, as well as implementing the industry standard SAX2 XML parsing interface, it also provides its own pull parsing interface. This project uses the XML pull parser to permit an inverted implementation of the first of the two co-parsers. It should then be possible to have a singly threaded implementation.

Particularly care will be needed to address the error handling requirements of the more demanding ARP users. If conducted successfully the project would result in a new version of ARP which would continue to be the leading RDF implementation. It would probably be deployed on the W3C web site. That success depends on motivation, programming skill and attention to detail.

Intelligent RDF Cache

The semantic web will be populated with many schemas or ontologies which will overlap. To manage and exploit this situation it would be useful to be able to partially automate schema matching - either finding correspondence points between a given pair of schemas or finding all schemas in a set that have some non-trivial overlap with a target schema.

Approaches can be based on just schema data or instance data as well, can be based on element matching and/or structure matching and can exploit dictionaries and networks of known ontology correspondences. See Rahm&Bernstein,2001 for a review.

Many projects exploring different parts of this space in the context of the semantic web are possible. The proposed project is aiming at semi-automated support of the case where a library of existing schemas with some known correspondences exists and a user is attempting to discover overlaps between this library and a new schema under construction. The initial focus should be on term matching based on local constraints, known correspondences to near-neighbour concepts and heuristic concept matching using dictionaries like wordnet. If successful extension to schema structure matching would be possible.

Skills - knowledge of ontology languages (OIL, DAML or OWL preferably), knowledge of symbolic processing techniques (e.g. graph matching, constraint satisfaction), java programming

RDF Bibliography Manager

Researchers in scientific disciplines maintain extensive bibliographies of papers and books they have read, are critiquing, or are building upon. Typically, such databases are poorly organised, or are not shareable with other researchers. A semantic-web based bibliography manager could address both of these concerns, and enable new functionality such as recommendation and discovery services.

There are four parts to this project, which can be undertaken separately as time allows. The first will be investigating the currently in-use schemas for storing bibliography information, including common tools such as BibTex, ProCite and EndNote. Discussing the uses and encodings of bibliographic information with the on-site research library staff would also be advantageous. The end result of this phase will be a unifying RDF/DAML ontology that allows for the consistent encoding of bibliographic data from multiple sources or formats. The second phase will be a set of filters that map between bibliography formats (e.g. from BibTeX, via the unified schema, to ProCite). A renderer for directly presenting bibliographic information in HTML would also be advantageous. The third phase would be to develop a set of web services that allow a researcher to manage, update and share their personal bibliography database(s) via the web. The final phase is to integrate into this personal tool the means to access remote bibliographic services, such as CiteSeer and Inspec/IEEE, and to share data with other researchers.

Skills: Java programming, application design (including understanding user needs).
Desirable: familiarity with bibliographic data formats or citation tools; ontology design; knowledge of DAML+OIL or OWL.