Putting the ICDL on the Semantic Web by Katy Newton
Purpose & Scope
The International Children's Digital Library (ICDL) is an ongoing research project whose aim is to provide access to digitized books and information to children aged 3-13. The project combines technologies for archiving, searching for, and viewing books with established principles behind kid-oriented interface design and user-driven indexing. I plan to analyze the current implementation of ICDL and make recommendations for the tranformation of ICDL into Semantic Web compliance. The primary deliverable will be an ontology that represents the classification schema used by ICDL's catalogers. All aspects of the analysis and ontology development will keep in mind the access needs and capabilities of the users, as well as the storage and representation capabilities of the library's maintainers.
Since the birth of libraries, people have had the opportunity to see and use documents that have in common their membership in a certain collection. Traditionally, operators of libraries have control over which documents become members of the collection and how those documents can be located and accessed. Access is driven by matching user-described needs (i.e. the user gives a description of some of the qualities a document must have for it to be of interest to him) with a classification schema. A classification schema is a description of the metadata elements that describe a document in terms that are likely to be of interest to searchers. For example, a typical academic library's classification schema consists of describing books and journals by virtue of their author/creator, title, publisher, publication date(s), standard number (ISBN or ISSN), language, and subject headings. Each of these elements, in turn, is described according to rules and practices, such as name authorities for author headings and subject classifications (typically the Library of Congress Subject Headings) for subject headings. The search interface, be it a card catalog or a fancy Web-based interface, is designed with the purpose of helping users to describe their information need in terms most likely to match the terms used in the classification schema. If a library user wants to find red books with tiny print, she will probably be out of luck because print size and cover color are not elements of the classification schema. Much of the art and science of librarianship lies in understanding the needs of the users of a collection, designing a classification schema that describes documents in such a way as to meet those needs, and finally, in presenting the schema to the user via an interface that the user can understand and navigate.
Most of the navigation of bibliographic collections is done using language. We use language to express semantics. Our meaning is in our head long before we say it or write it, but language allows us to -- as best we can -- communicate that meaning to a person or system. Our mixed blessing (good for poetry, bad for searching) is the ambiguity of human language. If I want to find documents about cars, I might be out of luck unless somebody tells me that the proper term to search for is automobiles. The practice of establishing controlled vocabularies is the librarian's attempt to remedy this problem. Controlled vocabularies, usually presented via thesauri, list the terms that might be used to describe a document, along with terms that might be thought to be used to describe a document. In the case of the latter, a "see" reference points to the term one should use when searching a classified collection.
By mapping terms in Web documents to the ontologies that describe those terms by virtue of their properties, super-classes, sub-classes, etc., the Semantic Web introduces to Web searching the concept of controlled vocabularies. The ontologies that Semantic Web programmers make available on the Web allow future authors and catalogers to point to a definition of automobile and say, "That's what I mean when I say this site is about cars!"
We have defined libraries above as collections of documents that are selected and represented by librarians, then used by people. One of the down sides of "old fashioned" libraries is that they sit in one place. If somebody wants to use a library, he has to find a way to travel to it within the constraints that in-person operations present (like operating hours, building accessibility, distance). Plus, unless someone is generous enough to let you read over his shoulder, the materials in a brick and mortar library may only be used by one person at a time. Many of these constraints disappear when we begin to speak about digital libraries. Like their traditional counterparts, digital libraries consist of documents that are selected, organized, and represented by people (not always librarians, though). The difference lies in the means of access: Digital library materials are not kept on shelves but on servers, where they can be searched and viewed at any time (unless the server goes down!) and from any place that has the capability to connect to the digital library server. Another powerful feature of digital libraries is that when a resource appropriately belongs to more than one category, it can be associated with multiple places without any serious compromise of time, energy, or space.
Semantic Webbing the ICDL
The International Children's Digital Library is a joint project of the University of Maryland and the Internet Archive. A preliminary version of the ICDL was launched on 20 November 2002, and contains approximately 200 books. The goal of the project is to provide access to 100 books from each of 100 cultures. Books are donated to the library by various organizations from around the world, and the books are scanned and rendered in the ICDL software either through a Web browser or through Adobe Book Reader. Between the scanning and presentation of each book is the cataloging process. Books are cataloged by ICDL staff, many of whom are University of Maryland undergraduate students.
The catalog is housed in a Microsoft Access relational database, which is queried when an end user submits a search or calls up a book. The structure of this database translates well to a series of RDF ontologies, with the added benefit that metadata represented in ontologies can be shared. For instance, "people" act as both authors and illustrators. Therefore, there is one ontology that acts as a template for information about people, then the role of a person can be used when that person is the object of a "writtenBy" or "illustratedBy" triple. The structure of these ontologies is as such:
- Book - the main ontology that describes books according to their metadata.
- Age - describes the three available age levels that can be assigned to a book.
- Award - describes an award by virtue of its name, the year awarded, and an id number.
- Genre - allows for the creation of a genre, where instances (for example) are allowable genres.
- Person - describes people by virtue of name, bio, and id number. Person is the Range for both the writtenBy and illustratedBy properties in the Book ontology.
- Publisher - describes the publisher of a book by virtue of its name, id number, and location. A location is called by its id number, which resolves to geopolitical entity with the Location ontology.
- Location - describes a location by virtue of city, state, and country. A unique identifier allows common locations, such as New York, NY, USA, to be called by several different Publisher instances, instead of being repeated every time.
In defining the elements for these ontologies, I only rarely used existing classes and properties. The processes of the ICDL staff call for the use of metadata elements that are designed to meet their needs. Some of the restrictions on classes and properties in existing ontologies might cause problems for the ICDL now or later. If the ontologies are custom-designed, their elements will be exactly what the staff need, and they can be modified whenever necessary.
See instances of ICDL books that have been marked up according to this schema:
The ontologies represented here only begin to describe all of the elements used by ICDL catalogers. A thorough fleshing out of these ontologies, including a drill down to all of the levels of the subject ontology (of which genre and age, discussed above, are among 13 main facets) would require more time and coding, but not much extra complexity. It would also be useful to use the currently cataloged books to develop an ontology of cultures, with instances from which catalogers can choose. This would prevent apples-and-oranges-type classifications, such as Japan for one book's culture and Japanese for another.
The real utility of this process would lie in using tools with GUIs to streamline these processes. If a tool such as RIC could be tweaked to meet the needs of ICDL catalogers, with instances of subject, location, and publisher ontology elements available for insertion into the fields their cataloging, there will be a little less room for cataloger error. Next, the instances created using book.daml could be stored in a database of RDF triples, such as Sesame, which would manage storage and retrieval of ICDL books. More senior ICDL staff could use SMORE to modify the ICDL ontologies as cataloging process change or become more precise.