Introduction
Before the general adoption of semantic web in future, there is a tremendous amount of information available on the web: telephone directories, product categories, stock quotes, weather forecasts, etc. Most of them are structured (pages are generated using a common template or layout) or semi-structured (pages using template with variations, e.g. missing attributes, attributes with multiple values, exceptions and typo). For example, Amazon lays out the author, title, comments, etc. in the same way in all its book pages. The values used to generate the pages (e.g., the author, title,...) typically come from a database. How to make full use of the existing data is an important thing to do in the semantic web field.
A wrapper is a procedure for extracting a particular resource's content. In the web enviroment, wrapper is essential to allow machines to extract data from semi-structured web pages. The most primitive approach is to write a wrapper by hand using programming language such as Java. Unfortunately, hand-coding wrappers is tedious and error-prone. To solve this problem, many approaches are proposed to rapid wrapper induction. With the help of these approaches, it is possible to use the valuable information in the web.
The rest of the presentation will be arranged as following: first, we will give a introduction to the approaches about wrapper induction. Then, some semantic web applications related to the wrapper induction will be presented. Finally, we will show a concrete demo about nature documents to demonstate one possible application.
Research on Wrapper Induction
Since wrapper induction is essential to a lot of applications such as content management/transformation/searching, a lot of research focus had been put on it.
Kushmerick et al[1] defined a family of wrapper classes, which essentially consist of linear Finite-State Transducers that extract data by recognizing delimiters between attributes. A combination of empirical and analytical techniques are also presented to evaluate the computational tradeoffs among the different classes. The criterias involved included the expressiveness and efficiency of wrapper. They use the labeled training tuples to induce the wrappers. And the results indicated that most of their wrapper classes can handle 70% of surveyed sites, yet can rapidly learned from a handful of examples. Hsu et al [2] improved the FST approach to allow multiple outgoing edges. Another improvement is that they use the concept of separators instead of delimiters. The separators are described by their contexts which allow a wrapper to distinguish different attribute transitions.
Knoblock et al[3] developed a hierarchical wrapper induction algorithm that learns extraction rules based on on examples labeled by the user. The key point is they try to exploit the hierarchical structure of the source to constrain the learning problem. They also provided the verificaiton and re-induction algorithms to automatically adapt to changes in the sites from which the data is being extracted. Muslea et al. [4] describe an algorithm for learning a wrapper language that, like Hsu et al, allows disjunction and reordered or missing attributes. The main contribution of Muslea et al. is that their language permits an arbitrary sequence of “landmarks” (e.g., “extract at the first ‘<B>’ following the next ‘<HR>’”).
Arasu et al[5] studied the problem of automatically extracting the database values from the web pages without any learning examples or other human input. First, they extracted the equivalence classes from the input pages according to the frequencies of tokens, then the analysis module try to generate template and values from the equivalence classes. They showed experimentally that the extracted values make semantic sense in most cases.
Applications
Wrapper Induction can be used as a lowest level tools in the semantic web applications to process the large numbers of heterogeneous, distributed, and semi-structured documents. For example, in the On-to-Knowledge Project[6], which is focus on the Ontology-Based Knowledge Management, Fensel use OntoWrapper / OntoExtract to extract both semistructured and structured documents found in large company intranets and on the Internet.
We can also use wrapper for the ontology conversion, semantic search, cross-site content management, etc.
Demo on Nature Documents
Original Document
Wrapper
Nature Ontology
References
[1] Nicholas Kushmerick, Daniel S. Weld, Robert Doorenbos, Wrapper Induction for Information Extraction, Intl. Joint Conference on Artificial Intelligence, 1997, pp. 729-737
[2] Chun-Nan Hsu, Ming-Tzung Dung, Generating Finite-State Transducers For Semi-Structured Data Extraction From The Web, Information Systems, 1998(8), pp. 521-538
[3] Craig A. Knoblock, Kristina Lerman, Steven Minton, Ion Muslea, Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach, IEEE Data Engineering Bulletin, 2000(4), pp. 33-41
[4] Ion Muslea, Steve Minton, Craig Knoblock, A Hierarchical Approach to Wrapper Induction, Proceedings of the Third International Conference on Autonomous Agents, 1999, pp. 190--197
[5] Arvind Arasu, Hector Garcia-Molina, Extracting Structured Data from Web Pages, Technical Report
[6] Dieter Fensel, Ontology-Based Knowledge Management, IEEE Computer, 2002(11), pp. 56-59
Lin:
David : A cwm n3 script for scraping to the ontology, so far:
@prefix na: <http://www.wam.umd.edu/~krakatoa/cs828y/second/nsua.daml#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix log: <http://www.w3.org/2000/10/swap/log#> .
@prefix str: <http://www.w3.org/2000/10/swap/string#> .
@prefix : <?#>.
this log:forAll :nsuXmlDoc , :nsuUri , :nsuContent , :nsuArtType , :articleidlist, :body , :i ,
:fm , :nsLead , :nsTitle, :nsCreator .
{
:nsuXmlDoc :href :i .
:nsuUri log:uri :i .
:nsuUri log:content :nsuContent .
(:nsuContent "<nsuarticle type=\"(.+?)\">") str:scrape :nsuArtType .
(:nsuContent "<articleidlist>(.*)</articleidlist>") str:scrape :articleidlist .
(:nsuContent "<fm>(.*?)</fm>") str:scrape :fm .
(:fm "<title>(.*?)</title>") str:scrape :nsTitle.
(:fm "<aug>(.*?)</aug>") str:scrape :nsCreator.
(:fm "<standfirst>(.*?)</standfirst>") str:scrape :nsLead.
} log:implies {
[ a na:NSUArticle ;
na:articleType :nsuArtType ;
dc:title :nsTitle ;
na:lead :nsLead
] .
} .
[ :href "http://www.mindswap.org/2002/nature/000127-6.xml" ] .
To execute this, call "python cwm natureDocuments.n3 --filter=simpleScraper.n3 --rdf > sample.xrdf". I also wrote a slightly smarter scraper, but it generates a bunch of temporary properties that I haven't figured out how to remove yet.