Part 1
Current work


1  motivations

The current website code is actually version 3. The first version was a combination of scripts written in different languages. The second was in perl. It used one back end library, but each page was it's own separate script which contained many lines of boilerplate code. This system became unmanageable. The api to the back end changed regularly, necessitating updates to each script. Very few reasoning capabilities were provided. The design of the current code was intended to alleviate these problems by producing a providing a stable api for each page to use and keeping boilerplate code in each individual page to a minimum.

1.1  separating reasoning/querying

It was immediately apparent that if reasoning steps were performed with each query that this would make loading web pages prohibitively slow. Performing reasoning steps when the data is loaded and storing the results allows the system to answer queries quickly.

1.2  keeping a stable query api

1.3  querying based on what types of things you want to output

The querier API was designed to provide support for the sorts of queries needed to construct a web page.

1.4  independent from what's outputted

The querier module was designed to never output HTML or any other specific format itself. It leaves that for scripts or other supporting libraries

2  code overview

2.1  concept of page packs

The code that makes up the semantic web sites our lab runs is split up into page packs. Each page pack is a set of configurable web page generating scripts that can be included in different web sites. Each web site has a installation file that describes what page packs make up that site as well as other site specific information such as the location of the database the site should use.

2.2  how pages are run

2.2.1  content negotiation

It has become common for sites offering RDF files for download to use content negotiation to provide a HTML summary of the file to web browsers while serving the actual RDF to scripts. To support this each page file contains a list of content types it can produce, what function in the file produces them and a weighting for each. The web server module uses these and the "Accept:" string from the client to decide which function to call.

2.2.2  caching

Even with inferencing being completed when data is modified instead of when pages are displayed the site is prone to being slow. To alleviate this a system of caching is used. The cache is invalidated by changes to the database but persists until some change occurs. Each page on the site can request to be cached (caching is not helpful for some pages) and can control how arguments to the page affect the validity of the cache. The cache is stored in a mysql database.

2.2.3  menus

The menus are generated automatically for each page.

2.2.4  titles

Pages can set their own title, or use the default title provided by the page handler. The default title is found by taking the page's URI and finding any Dublin Core title's related to it in the database.

2.3  descriptions of page packs

2.3.1  downloads

index  
demos  

2.3.2  extract

index  

2.3.3  funding

index  

2.3.4  images

codepict  
The codepiction page finds paths of codepictions between people. This is similar to the foaf codepiction project but it can handle the more complex image markup data that PhotoStuff produces. Additionally, it can find links between any two spatial things. To narrow the results it first asks the user for a class to consider from the class hierarchy under SpatialThing. It then provides two lists of instances of that class. The user selects two instances and the code looks for path of codepictions limited to instances of that class. For example, if the user selects foaf:Person as the class it will only show codepictions that connect those users that go through pairs of people. A path that showed that both people had been photographed at the Lincoln Memorial would not be shown.
find  
The image search page provides the ability to search for images that contain two specific individuals, any instance of one class and another specific individual, or instances of two classes. The first search is fairly straightforward, but the second two have special cases. If the user requests all images that contain Lassie and an instance of a dog they probably want images containing Lassie and another dog, not images containing only Lassie (who happens to be a dog). Similarly, if the user requests all images that contain a pet and a dog they probably want images containing a dog and a ferret or two dogs but not images containing one dog. The image search page accommodates this.
If the user searches for images that contain two specific individuals then the program finds all images that contain the first individual and then checks for which of those contain the second individual.
If the user searches for images that contain any instance of one class and another specific individual then the program first creates a list of all instances of the class. Then it creates a list of all images containing the requested individual. It then winnows the list of images by only keeping those that contain an instance of the class. Finally, if the requested individual was a member of the class it removes images that contain only the individual and not any other member of the class.
If the user searches for images that contain instances of two classes the program first creates lists of all the individuals in each class. It creates a list of all images containing instances of the first class and a list of all images containing instances of the second class. It then creates a list of images that are in both lists. Then the program checks whether one class is a subclass of (or the same as) the other. If so, it checks whether the image contains two distinct instances of the superclass. If not that image is discarded.
index  
viewpart  

2.3.5  instance

index  
The instance viewer is one of the most complicated parts of the website. Other pages can use encodeResource to link to a view of any RDF instance in the database on this page. The instance viewer has two modes: a generic view for any RDF, and a type specific view that handles instances of known classes more intelligently.
The generic mode
The instance viewer has a set of modules for handling different known classes. Each handler states what classes it knows how to display. The instance viewer compares the class of the instance it is supposed to display to the list of classes that handlers know how to display. The instance viewer chooses the most specific handler. For example if the instance viewer is asked to display an instance of a undergraduate it will choose the handler that says it can display people over the handler that says it can display owl:Thing. There is in fact a handler that displays owl:Thing to act as a fall-back.
The instance viewer provides provenance information. Each RDF triple in the database has a context and most contexts have one or more source documents associated with it. (Olco attempts to provide reasonable sources for triples it creates by tracking which other triples were used to infer this one.) Any time a triple is accessed its context is added to a list of used contexts. There are two ways to access this data. One method provides all contexts used since the database was opened. The other provides all contexts used since the last time that method was called. This makes it possible to find out what contexts were involved in producing a single displayed fact in the instance viewer by calling the method, doing the queries involved to produce the display, and then calling the method again and using the list of contexts returned this time.

2.3.6  links

index  
The links page provides a related links or bookmarks page for the site. The script descends through the class hierarchy rooted at a bookmark class. It displays the class hierarchy and all instance data where the instances are bookmarks. This is similar to the rdf map.

2.3.7  manage

The management pages allow authorized users to create, edit, and delete files containing RDF content. Additionally they allow users to edit the list of remote RDF files that are read into the database.
The management page contains metadata that is edited and deleted through a script rather than directly as RDF. To do this the page uses a second RDF database just for metadata. This metadata includes file's authors, modification dates, and type.
accept-rdf  
The accept rdf page is responsible for creating new RDF files and making the updates to those files. New RDF files are numbered sequentially. If a user submits a new RDF file it is assigned a number and saved. The metadata the user submitted is added to the metadata database. If a user asks to edit existing data the existing file is backed up and the new data is copied into place. The metadata the user provides is merged with the existing metadata for that file in the metadata database.
dbmanage  
The dbmanage script has several features. It can display statistics about the database, re-inference on the database (remove all inferred triples and re-infer), reload the database (completely empty the database and reload each file and then infer), and invalidate the page cache.
forms  
The forms help users conveniently submit data. Adding some kinds of instance data can be complicated. Users forget the necessary properties or misspell them. The forms allow users to simply fill in blanks to add new instances.

   equate  
The equate script allows users to equate multiple instances conveniently.

   funding  

   image  

   names  
The names script allows users to relabel instances. querier provides a labels function that returns a list of labels for each instance. If a RDF label property exists that is used. Otherwise the function searches for appropriate sub-properties of RDF label. Unfortunately that means that an instance may have many labels and it's unclear which one is best to use. The names script allows a user to add an overriding label.

   news  

   paper  
The papers script allows users to submit bibliography information. Because the bibTEX format is difficult to understand this script is particularly important. Unfortunatly the bibTEX ontology the site uses is an old DAML ontology and is very poorly written. Because of this, this script mostly uses hard coded facts about bibTEX.

   projects  
index  
The main manage script displays a table of all the RDF files stored locally and links to edit or delete them. It also displays a table of the outside RDF files that are included in the database.
The script iterates over each of the stored RDF files on disk and then uses the RDF database to retrieve information about their title, author, type, and modification date.
The script can also sort by _________ and filter by ______________.
metamanage  
outside-uri  
The outside-uri script handles adding and deleting uri's from a file containing the list of uri's read into the database. It also modifies the metadata database similarly to accept-rdf.
remove  
submit  
  
undelete  

2.3.8  news

index  
The index script provides an archive of news items if it is called with no arguments. If it is passed a news item it provides a full display of that news item. The RSS feed points at these as full descriptions of the news items listed in it. To generate the news archive the script uses querier to retrieve a listing of all the instances of "news item" and sorts these by date. Then it displays a brief summary for each item. To display a news item the script simply retrieves the values of several properties of the item and prints them out.
rss  
The rss script produces an RSS feed for the site. It retrieves a list of all news items sorted by date and removes all items more than a week old unless that would leave less than five items. For each remaining item it retrieves the information about it that is relevant to an RSS feed and uses a python RSS module to add that item to an RSS object which is then output.

2.3.9  papers

index  
The papers page is unique because HTML is not produced directly from the RDF database. Instead, the script produces a bibtex bibliography database and uses a bibtex2HTML script to display that database. This avoids the challenge of properly formatting bibliographies.

2.3.10  people

index  
The people page is very similar to the rdf map, but it shows less data.
pages  

2.3.11  projects

index  

2.3.12  rdf

dump  
The dump script simply serializes the main database and outputs the serialization.
images  
index  
The RDF map provides an interface for browsing the class tree and instance data. To produce the main RDF map page we begin at owl:Thing and follow the statements about most specific subclass down through the directed acyclic graph they represent. Unfortunately it is difficult to represent a dag on a web page. Therefore we produce a tree by only displaying each class one time. If we encounter a class a second time in the hierarchy we provide a link to the original listing.
If you click on the title of a class in the rdf map you are taken to a subpage that displays the class hierarchy rooted at the class. Most of the code to do this is shared with the main RDF map page, but it starts with only one class instead of all top level classes. Additionally, these pages display instance data.
search  

2.3.13  wikis

index  
The wikis page provides a listing of available wikis. It iterates through the list of instances of wiki and for each displays the uri, the label, and a list of all classes that wiki is an instance of. This provides information about whether it is public or private, active or inactive, and so forth.

2.4  description of olco/querier

2.4.1  How olco/querier fit together

So that querier can respond to queries quickly we preprocess the data with olco. Olco has two major tasks: rule based reasoning and cannonicalization. Olco also attempts to track provenance information during these steps.

2.4.2  Olco

reasoning/new assertions  

   symmetric properties  
When olco sees a property declared symmetric it declares the property to be its own inverse.

   inverse properties  
When olco sees that a property foo has an inverse bar it searches for all triples "a foo b" and asserts "b bar a" and searches for all triples "c bar d" and asserts "d foo c."

   funtional properties/inverse functional properties  
If a functional property has an inverse olco declares the inverse to be an inverse functional property. If an inverse functional property has an inverse olco declares the inverse to be a functional property.

   infering two things are the same with inverse functional properties  
Olco uses inverse functional properties to infer that two instances in the database are actually the same instance. If two instances have the same value for the same inverse functional property then olco asserts they are sameAs each other.
cannonicalization  

   sameAs  
Because sameAs is a transitive property every time querier tried to return information about an individual it would have to follow the branching sameAs properties. Olco creates equivalence classes of all individuals in the data store. It then picks on individual in each equivalence class and treats it as the canonical version. Each individual in that class then is declared mindswap:sameAs the canonical version. This way querier merely needs to do two checks: one for the canonical version, and the other for all individuals associated with that canonical version.

   subclass  
A second example is subclass relationships. Olco takes all the statements about subclass relationships and forms a directed acyclic graph with one root node which is owl:Thing. Any class which is not asserted to be a subclass of anything else is declared to be a subclass of owl:Thing.

   subproperty  
The subproperty hierarchy is treated the same way as the subclass hierarchy.
provenance  
As olco performs reasoning and canonicalization steps it attempts to preserve provenance information. When olco asserts a new triple it attempts to create a context that includes as sources all the source documents that were used in creating that triple.

   what we do correctly  
For example, when olco asserts that a property is its own inverse it uses as a source for the context the document that stated the property is symmetric.
When olco sees a property foo with an inverse bar and a triple "a foo b" and asserts "b bar a" is uses as a source for the context the documents that assert that the property foo has an inverse bar and that "a foo b."

   what we do half correctly  

   what we don't do  

2.4.3  Querier

3  problems encountered/things learned

3.1  dag isn't conveniently representable on a web page

One challenge we encountered is that the owl class hierarchy is a directed acyclic graph. It is natural to represent a tree on a web page as a nested list or outline. It is hard to represent a DAG. On pages where this became a problem we generally chose to fully display each class only once, and provide a note to "see above" each subsequent time we encountered the class.

3.2  anonymous nodes are a problem

Instances represented by anonymous nodes create problems with linking. Often a page provides a link to another page that will display specific information about an instance. If that instance is anonymous there is no perfect information to include in the link to locate it. Our solution is to use the first option on this list that is available:
Because this is a very common property we encapsulated this functionality in two functions in querier: one to create a string suitable to use in a link to represent a particular instance and one to parse those strings and return instances. Clearly in some cases the parsing function fails because the instance no longer exists or if the last option was used some of the property value pairs may have changed.

3.3  repairing broken source data (incorrect owl, etc.)

The web sites use data drawn from different sites and accumulated over time. Some of it is OWL, some is actually DAML, and some is not valid DAML or OWL. Rather than attempt to modify all this data we chose to correct for it as much as possible when olco runs.

3.4  efficient edits

When any data is added, deleted, or edited olco deletes all inferences. It would be more efficient to modify the inferences based on what has changed but this would be much more difficult.

Part 2
Future work





File translated from TEX by TTH, version 3.60.
On 27 Apr 2005, 15:47.