Part 1
Current work
1 motivations
The current website code is actually version 3. The first version
was a combination of scripts written in different languages. The second
was in perl. It used one back end library, but each page was it's
own separate script which contained many lines of boilerplate code.
This system became unmanageable. The api to the back end changed regularly,
necessitating updates to each script. Very few reasoning capabilities
were provided. The design of the current code was intended to alleviate
these problems by producing a providing a stable api for each page
to use and keeping boilerplate code in each individual page to a minimum.
1.1 separating reasoning/querying
It was immediately apparent that if reasoning steps were performed
with each query that this would make loading web pages prohibitively
slow. Performing reasoning steps when the data is loaded and storing
the results allows the system to answer queries quickly.
1.2 keeping a stable query api
1.3 querying based on what types of things you want to output
The querier API was designed to provide support for the sorts of queries
needed to construct a web page.
1.4 independent from what's outputted
The querier module was designed to never output HTML or any other
specific format itself. It leaves that for scripts or other supporting
libraries
2 code overview
2.1 concept of page packs
The code that makes up the semantic web sites our lab runs is split
up into page packs. Each page pack is a set of configurable web page
generating scripts that can be included in different web sites. Each
web site has a installation file that describes what page packs make
up that site as well as other site specific information such as the
location of the database the site should use.
2.2 how pages are run
2.2.1 content negotiation
It has become common for sites offering RDF files for download to
use content negotiation to provide a HTML summary of the file to web
browsers while serving the actual RDF to scripts. To support this
each page file contains a list of content types it can produce, what
function in the file produces them and a weighting for each. The web
server module uses these and the "Accept:" string from the client
to decide which function to call.
Even with inferencing being completed when data is modified instead
of when pages are displayed the site is prone to being slow. To alleviate
this a system of caching is used. The cache is invalidated by changes
to the database but persists until some change occurs. Each page on
the site can request to be cached (caching is not helpful for some
pages) and can control how arguments to the page affect the validity
of the cache. The cache is stored in a mysql database.
The menus are generated automatically for each page.
Pages can set their own title, or use the default title provided by
the page handler. The default title is found by taking the page's
URI and finding any Dublin Core title's related to it in the database.
2.3 descriptions of page packs
index
demos
index
index
codepict
The codepiction page finds paths of codepictions between people. This
is similar to the foaf codepiction project but it can handle the more
complex image markup data that PhotoStuff produces. Additionally,
it can find links between any two spatial things. To narrow the results
it first asks the user for a class to consider from the class hierarchy
under SpatialThing. It then provides two lists of instances of that
class. The user selects two instances and the code looks for path
of codepictions limited to instances of that class. For example, if
the user selects foaf:Person as the class it will only show codepictions
that connect those users that go through pairs of people. A path that
showed that both people had been photographed at the Lincoln Memorial
would not be shown.
find
The image search page provides the ability to search for images that
contain two specific individuals, any instance of one class and another
specific individual, or instances of two classes. The first search
is fairly straightforward, but the second two have special cases.
If the user requests all images that contain Lassie and an instance
of a dog they probably want images containing Lassie and another dog,
not images containing only Lassie (who happens to be a dog). Similarly,
if the user requests all images that contain a pet and a dog they
probably want images containing a dog and a ferret or two dogs but
not images containing one dog. The image search page accommodates
this.
If the user searches for images that contain two specific individuals
then the program finds all images that contain the first individual
and then checks for which of those contain the second individual.
If the user searches for images that contain any instance of one class
and another specific individual then the program first creates a list
of all instances of the class. Then it creates a list of all images
containing the requested individual. It then winnows the list of images
by only keeping those that contain an instance of the class. Finally,
if the requested individual was a member of the class it removes images
that contain only the individual and not any other member of the class.
If the user searches for images that contain instances of two classes
the program first creates lists of all the individuals in each class.
It creates a list of all images containing instances of the first
class and a list of all images containing instances of the second
class. It then creates a list of images that are in both lists. Then
the program checks whether one class is a subclass of (or the same
as) the other. If so, it checks whether the image contains two distinct
instances of the superclass. If not that image is discarded.
index
viewpart
index
The instance viewer is one of the most complicated parts of the website.
Other pages can use encodeResource to link to a view of any RDF
instance in the database on this page. The instance viewer has two
modes: a generic view for any RDF, and a type specific view that handles
instances of known classes more intelligently.
The generic mode
The instance viewer has a set of modules for handling different known
classes. Each handler states what classes it knows how to display.
The instance viewer compares the class of the instance it is supposed
to display to the list of classes that handlers know how to display.
The instance viewer chooses the most specific handler. For example
if the instance viewer is asked to display an instance of a undergraduate
it will choose the handler that says it can display people over the
handler that says it can display owl:Thing. There is in fact a handler
that displays owl:Thing to act as a fall-back.
The instance viewer provides provenance information. Each RDF triple
in the database has a context and most contexts have one or more source
documents associated with it. (Olco attempts to provide reasonable
sources for triples it creates by tracking which other triples were
used to infer this one.) Any time a triple is accessed its context
is added to a list of used contexts. There are two ways to access
this data. One method provides all contexts used since the database
was opened. The other provides all contexts used since the last time
that method was called. This makes it possible to find out what contexts
were involved in producing a single displayed fact in the instance
viewer by calling the method, doing the queries involved to produce
the display, and then calling the method again and using the list
of contexts returned this time.
index
The links page provides a related links or bookmarks page for the
site. The script descends through the class hierarchy rooted at a
bookmark class. It displays the class hierarchy and all instance data
where the instances are bookmarks. This is similar to the rdf map.
The management pages allow authorized users to create, edit, and delete
files containing RDF content. Additionally they allow users to edit
the list of remote RDF files that are read into the database.
The management page contains metadata that is edited and deleted through
a script rather than directly as RDF. To do this the page uses a
second RDF database just for metadata. This metadata includes file's
authors, modification dates, and type.
accept-rdf
The accept rdf page is responsible for creating new RDF files and
making the updates to those files. New RDF files are numbered sequentially.
If a user submits a new RDF file it is assigned a number and saved.
The metadata the user submitted is added to the metadata database.
If a user asks to edit existing data the existing file is backed up
and the new data is copied into place. The metadata the user provides
is merged with the existing metadata for that file in the metadata
database.
dbmanage
The dbmanage script has several features. It can display statistics
about the database, re-inference on the database (remove all inferred
triples and re-infer), reload the database (completely empty the database
and reload each file and then infer), and invalidate the page cache.
forms
The forms help users conveniently submit data. Adding some kinds of
instance data can be complicated. Users forget the necessary properties
or misspell them. The forms allow users to simply fill in blanks to
add new instances.
equate
The equate script allows users to equate multiple instances conveniently.
funding
image
names
The names script allows users to relabel instances. querier provides
a labels function that returns a list of labels for each instance.
If a RDF label property exists that is used. Otherwise the function
searches for appropriate sub-properties of RDF label. Unfortunately
that means that an instance may have many labels and it's unclear
which one is best to use. The names script allows a user to add an
overriding label.
news
paper
The papers script allows users to submit bibliography information.
Because the bibTEX format is difficult to understand this script
is particularly important. Unfortunatly the bibTEX ontology the
site uses is an old DAML ontology and is very poorly written. Because
of this, this script mostly uses hard coded facts about bibTEX.
projects
index
The main manage script displays a table of all the RDF files stored
locally and links to edit or delete them. It also displays a table
of the outside RDF files that are included in the database.
The script iterates over each of the stored RDF files on disk and
then uses the RDF database to retrieve information about their title,
author, type, and modification date.
The script can also sort by _________ and filter by ______________.
metamanage
outside-uri
The outside-uri script handles adding and deleting uri's from a file
containing the list of uri's read into the database. It also modifies
the metadata database similarly to accept-rdf.
remove
submit
undelete
index
The index script provides an archive of news items if it is called
with no arguments. If it is passed a news item it provides a full
display of that news item. The RSS feed points at these as full descriptions
of the news items listed in it. To generate the news archive the script
uses querier to retrieve a listing of all the instances of "news
item" and sorts these by date. Then it displays a brief summary
for each item. To display a news item the script simply retrieves
the values of several properties of the item and prints them out.
rss
The rss script produces an RSS feed for the site. It retrieves a list
of all news items sorted by date and removes all items more than a
week old unless that would leave less than five items. For each remaining
item it retrieves the information about it that is relevant to an
RSS feed and uses a python RSS module to add that item to an RSS object
which is then output.
index
The papers page is unique because HTML is not produced directly from
the RDF database. Instead, the script produces a bibtex bibliography
database and uses a bibtex2HTML script to display that database. This
avoids the challenge of properly formatting bibliographies.
index
The people page is very similar to the rdf map, but it shows less
data.
pages
index
dump
The dump script simply serializes the main database and outputs the
serialization.
images
index
The RDF map provides an interface for browsing the class tree and
instance data. To produce the main RDF map page we begin at owl:Thing
and follow the statements about most specific subclass down through
the directed acyclic graph they represent. Unfortunately it is difficult
to represent a dag on a web page. Therefore we produce a tree by only
displaying each class one time. If we encounter a class a second time
in the hierarchy we provide a link to the original listing.
If you click on the title of a class in the rdf map you are taken
to a subpage that displays the class hierarchy rooted at the class.
Most of the code to do this is shared with the main RDF map page,
but it starts with only one class instead of all top level classes.
Additionally, these pages display instance data.
search
index
The wikis page provides a listing of available wikis. It iterates
through the list of instances of wiki and for each displays the uri,
the label, and a list of all classes that wiki is an instance of.
This provides information about whether it is public or private, active
or inactive, and so forth.
2.4 description of olco/querier
2.4.1 How olco/querier fit together
So that querier can respond to queries quickly we preprocess the data
with olco. Olco has two major tasks: rule based reasoning and cannonicalization.
Olco also attempts to track provenance information during these steps.
reasoning/new assertions
symmetric properties
When olco sees a property declared symmetric it declares the property
to be its own inverse.
inverse properties
When olco sees that a property foo has an inverse bar it searches
for all triples "a foo b" and asserts "b bar a" and searches
for all triples "c bar d" and asserts "d foo c."
funtional properties/inverse functional properties
If a functional property has an inverse olco declares the inverse
to be an inverse functional property. If an inverse functional property
has an inverse olco declares the inverse to be a functional property.
infering two things are the same with inverse functional properties
Olco uses inverse functional properties to infer that two instances
in the database are actually the same instance. If two instances have
the same value for the same inverse functional property then olco
asserts they are sameAs each other.
cannonicalization
sameAs
Because sameAs is a transitive property every time querier tried to
return information about an individual it would have to follow the
branching sameAs properties. Olco creates equivalence classes of all
individuals in the data store. It then picks on individual in each
equivalence class and treats it as the canonical version. Each individual
in that class then is declared mindswap:sameAs the canonical version.
This way querier merely needs to do two checks: one for the canonical
version, and the other for all individuals associated with that canonical
version.
subclass
A second example is subclass relationships. Olco takes all the statements
about subclass relationships and forms a directed acyclic graph with
one root node which is owl:Thing. Any class which is not asserted
to be a subclass of anything else is declared to be a subclass of
owl:Thing.
subproperty
The subproperty hierarchy is treated the same way as the subclass
hierarchy.
provenance
As olco performs reasoning and canonicalization steps it attempts
to preserve provenance information. When olco asserts a new triple
it attempts to create a context that includes as sources all the source
documents that were used in creating that triple.
what we do correctly
For example, when olco asserts that a property is its own inverse
it uses as a source for the context the document that stated the property
is symmetric.
When olco sees a property foo with an inverse bar and a triple "a
foo b" and asserts "b bar a" is uses as a source for the context
the documents that assert that the property foo has an inverse bar
and that "a foo b."
what we do half correctly
what we don't do
3 problems encountered/things learned
3.1 dag isn't conveniently representable on a web page
One challenge we encountered is that the owl class hierarchy is a
directed acyclic graph. It is natural to represent a tree on a web
page as a nested list or outline. It is hard to represent a DAG. On
pages where this became a problem we generally chose to fully display
each class only once, and provide a note to "see above" each
subsequent time we encountered the class.
3.2 anonymous nodes are a problem
Instances represented by anonymous nodes create problems with linking.
Often a page provides a link to another page that will display specific
information about an instance. If that instance is anonymous there
is no perfect information to include in the link to locate it. Our
solution is to use the first option on this list that is available:
- A URI that represents the instance
- An inverse functional property and value
- A list of all properties and values of the instance
Because this is a very common property we encapsulated this functionality
in two functions in querier: one to create a string suitable to use
in a link to represent a particular instance and one to parse those
strings and return instances. Clearly in some cases the parsing function
fails because the instance no longer exists or if the last option
was used some of the property value pairs may have changed.
3.3 repairing broken source data (incorrect owl, etc.)
The web sites use data drawn from different sites and accumulated
over time. Some of it is OWL, some is actually DAML, and some is not
valid DAML or OWL. Rather than attempt to modify all this data we
chose to correct for it as much as possible when olco runs.
3.4 efficient edits
When any data is added, deleted, or edited olco deletes all inferences.
It would be more efficient to modify the inferences based on what
has changed but this would be much more difficult.
Part 2
Future work
File translated from
TEX
by
TTH,
version 3.60.
On 27 Apr 2005, 15:47.