The goal of the project is to explore the tools that may be helpful for the task of obtaining structure from documents; this task may include parsing the document as the initial step and representing the structures discovered in the documents, procedures for manipulating these structures are also of interest.
Through this project, we have explored available packages or tools for the purpose of accessing structures in documents. The domain of documents we are interested in are XML and RDF documents, i.e. structured or relatively structured documents, rather than unstructured documents or plain HTML documents. Note that the latter case needs inferring the implicit structure in the documents, which is more of a wrapper’s task. In our case, we target the existing (explicit) structure in the document.
Trying to access or to retrieve the structure in the document naturally brings up the issue of parsing the document. Hence tools for parsing XML and RDF documents are of interest. Related to these are issues of detecting malformed documents and validation. A number of parsers have been evaluated for this purpose.
One certainly does not access a document just for the sake of accessing. The action of accessing a document intrinsically has a purpose. If it is a human user, the purpose may simply be to browse the documents or perform whatever task the user is interested in. It may also be a program that is accessing the document. This may, for instance, be a software agent performing a programmed task. We have explored several of the software options that may be of use for each of these cases.
For a human user, the tasks may be to browse existing documents or to create new documents. Hence we have looked at visualization tools that facilitate the task of reading an XML or RDF documents which are often far from being human readable. We also examined a number of document editors for document creation or modification and tried to find a convenient choice for this task.
In contrast to a human user, a machine accessing the documents has no interest in perception of the documents. The actions of a machine program originate from the instructions coded by the programmer. Therefore, in this case, the key issue is to facilitate the task of programming. Hence, we have looked a number of packages or api that frees the programmer from the burden of tedious text processing and those that naturally encapsulates the semantics of the document framework.
Below are given the descriptions of the programs or packages that we have found to be the most convenient in their categories.
Since the XML code does not exhibit a graph structure as RDF does, there is really not a visualization tool available for XML. What may, at best, be mentioned within this context are some of XML editors. The hierarchical tree-style option of viewing that they provide is the closest to visualization. Still it is merely a formatted display of the underlying code. For such programs, use the pointers in the section for XML editors.
We have found Isaviz to be a convenient tool for this purpose. The functionality provided is quite versatile, and the interface is very user-friendly. The only drawback that we have experienced is that the program runs slow on our system. A reason for this may be that our system runs Java version 1.3.x while the recommended version is 1.4.x.
Isaviz is a visual environment for browsing and authoring RDF models, represented as directed graphs. Resources and literals are the nodes of the graph (ellipses and rectangles respectively), with properties represented as the edges linking these nodes.
Isaviz requires Java and Graphviz to be installed. Isaviz has the following features:
For more information or download instructions, see the following link:
http://www.w3.org/2001/11/IsaViz/
Of the several XML editors that we have examined, we have found Peter's XML Editor and XMLWriter to be the most convenient tools. Both editors work on Windows platforms. They are easy to setup and have intuitive user interfaces. Both provide nearly the same functionality.
Peter's XML Editor allows the user view the directory contents and also incorporates a search function similar to that of Windows. It is possible to view the XML in plain text, hierarchical tree form and tree-structured text format of Internet Explorer (see the following snapshot).
For more information or download instructions, see the following link:
http://www.iol.ie/~pxe/index.html
XMLWriter provides similar functionality to that of Peter's XML Editor. Still it differs from Peter's XML Editor in the following. XMLWriter does not have a real tree-like display of the XML code, but has only the Internet Explorer style text format (see the following snapshot). However, it provides validation function and check for malformed code. Another convenient feature of XMLWriter as that it provides slots where one can literally plug in other programs. This is a desirable feature since one may want to incorporate other XML-related programs to XMLWriter. Slots for incorporation of up to ten programs are available.
For more information or download instructions, see the following link:
http://XMLWriter.net
RDFedt created by Jan Winkler is the tool evaluated for this purpose. RDFedt is a simple tool for creating RDF-/RSS-files. It supports RDF Model and Syntax, RDF Schema, the Dublin Core Element Set, RSS 1.0 and some of the RSS Modules. The goal of the program is to help RDF-designers to create RDF (or RSS) files fast and easily, which are XML valid, are structured, include correct namespaces, ... and so on. RDFedt is designed to be an RDF/RSS tool and not a 100% XML tool. It should not handle all XML Formats (for example WML or XHTML).
Altogether it is a simple tool and not a high-end program. RDFedt is not a text-editing program or a word processor. Therefore you have to use other programs (like Wordpad or so). What it does is to display the input RDF files in tree format or create new RDF documents from scratch again in tree format. It is pretty straightforward to make modifications through the GUI and it is much easier to follow compared to text editing (see the following snapshots).
An important point is that it is not a Java program, so it is not platform independent. RDFedt works only on Windows platforms.
For more information and download instructions, see the following link:
http://www.jan-winkler.de/dev/e_rdfe.htm
We have installed and evaluated Java XML Pack (summer 02 bundle) from Sun. This is a very versatile bundle including four packaged each of which is geared for a particular set of XML-related functions. The four packages are: JAXM 1.1_01, JAXP 1.2_01, JAXR 1.0_02, JAX-RPC 1.0_01, i.e. Java API for XML messaging, (Java API for XML processing, Java API for XML registries, Java API for XML-based RPC.
The second of these, JAXP, is of interest for us. It enables applications to parse and transform XML documents using an API that is independent of a particular XML processor implementation. The current version that we have installed uses Xerces 2, as its default XML parser and Xalan as its default XSLT engine. However, the pluggable architecture of JAXP allows any conformant implementations to be used. Using this software, application and tool developers can build fully-functional XML-enabled Java applications.
JAXP consists four packages, each of which includes the Java classes for certain tasks:
For more information and download instructions, see the following link:
http://java.sun.com/xml
The main concept of RDF is the model, containing information about a set of resources, which represent objects on the web or in the real world. RDF models are parsed and serialized to enable easy storage and transportation as a file or stream. Utilities for parsing and serializing RDF have been developed. However, no Application Programming Interface (API) is available. To use applications for designing / manipulating RDF models, self-made solutions have to be developed. Although it is possible to manipulate RDF models in XML format by using an API standard for XML, this approach is not favorable, as XML is just one way to display RDF.
This lack of a standard API for RDF has led to isolated solutions, which are often designed to fit to a specific purpose rather than to cover all important aspects of RDF modeling. Jena, developed by HPL Semantic Web group addresses this problem. Jena is a java API for manipulating RDF models. It provides a quite versatile toolbox that is more than sufficient for most RDF programmers.
Jena offers the following functionality:
This latest release of the jena toolkit integrates a number of new components, some of which are also available separately:
For more information and download instructions, see the following link:
http://www.hpl.hp.com/semweb/index.html
Experiments have been performed with the tested tools on the Nature data set. Since the tasks to be tested are relatively straightforward, the experiments have proceeded smoothly (as long as the program tested are stable versions). The snapshots given above are snapshots of the experiment with some of the nature data set. The examples provided by the Wrappers group were also used.
Starting Points for Misc Tools:
XML Parsing:
RDF Parsing:
RDF Visualization:
XML Editors:
RDF Editors:
XML Packages / API:
RDF Packages / API:
We welcome any comments or suggestions.