[an error occurred while processing this directive] [an error occurred while processing this directive] Demo 2 of RDF Web Scraper Version 1.1

Demo 2 of RDF Web Scraper Version 1.1:

This example demonstrates how the web scraper can be used to extract RDF  from the following Yahoo yellow page of Computer Graphics Businesses near College Park, MD:

 http://yp.yahoo.com/py/ypResults.py?&&city=College+Park&state=MD&zip=20742&country=us&slt=38.990200&sln=-76.942700&cs=5&stp=y&stx=11393295

Step 1: Dataset Classification

We start by determining the information we want to scrape from the page in terms of records and fields. In this case for example, each record would constitute information about a single business, and the corresponding fields could be the business name, phone number, address (street, city, state), and distance from base location (in this example, its College Park MD). Now we need to parse the web page in order to obtain this information in a tabular format (records| fields). For this, we look at the source html to see which tags can be used to parse the data.

Step 2: Creating a Wrapper to Parse the Data

 Here is a snippet of the html code from the above web page:

<tr bgcolor="#FFFFFF">
<td>&nbsp;</td>
<td><small><b><A HREF="/py/ypMap.py?Pyt=Typ&tuid=2272932&ck=3262457826&tab=B2C&ycat=11393295&city=College+Park&state=MD&zip=20742&country=us&slt=38.990200&sln=-76.942700&cs=5&stat=:pos:0:regular:regT:20:fbT:0"><font face="Arial" size="-1">Manifest 3d</FONT></A><font size="-2"><br>
</font></b>
<font size="-1" face="Arial"><B>(301) 270-4606</b></font>
<font face="Arial" size="-1">&nbsp;</font></small>
</td>
<td bgcolor="#FFFFFF"><small><font face="Arial" size="-1">1005 Sligo Creek Pkwy<br>
<B>Takoma Park, MD</B>
&nbsp;<A HREF="/py/ypMap.py?Pyt=Typ&tuid=2272932&ck=3262457826&tab=B2C&ycat=11393295&city=College+Park&state=MD&zip=20742&country=us&slt=38.990200&sln=-76.942700&cs=5&stat=:pos:0:regular:regT:20:fbT:0">Map</a><b> <br>
</b></font></small></td>
<td align="center" bgcolor="#FFFFFF"> 
<p><font face="Arial" size="-1">2.7</font></p>
</td>
</tr>

As can be seen from the html above, the first field we require is the business name (Manifest 3d) and its present after the tag <td>, (though its not the immediate previous tag, any previous tag before the last text can be considered) the next field is the phone number ((301) 270-4606) and its present after the tag <B>, the next field is the street address (Sligo Creek Pkwy) and its present after the tag <td, the next field is the city/state (Takoma Park, MD) and its present after  the tag <B and the last field is the distance from base in miles (2.7) and its present after the tag <p

Thus, notice that  when the original web page has data arranged in a tabular structure,  determining the parsing tags becomes very easy, since its usually a combination of <TR, <TD ,<b and <p tags.

Thus parsing the web pages gives us the following information:

 

Specifying Additional Text Delimiters:

Notice that in column 5 of the Record Table, the city and state information is present as a single string with the ',' character as the delimiter. In order to parse this information further, we can specify the delimiter in the corresponding column of the Fields Table as shown below to obtain the city and state information in two separate fields:

 

Step 3 : Specifying RDF Templates

Specifying the Ontology List and RDF templates is similar to that explained in Demo1. The two notable differences in this case are:

1. The use of nested properties (see screenshot above) such as "oc:Business,dc:name" present in column @2 of the Template Table. The program generates the appropriate nested RDF (linked below)

2. The use of a second template that describes a Resource Collection (Bag), in this case, a bag of business phone numbers. The Bag ID ("Graphics Business Phone Numbers") is entered in the corresponding column (Name/ID) of the Template Table, and the property "oc:PhoneNumber" is entered for the corresponding field (@3)

The entire RDF generated for the example above, is present here

 

I hope these two demos provide a clear picture of how this tool can be used to extract substantial RDF markup from regularly structured web pages. For further questions, suggestions, comments etc.. please send me an e-mail here. Thanks.

 

Aditya Kalyanpur

MIND SWAP Research Group,

University of Maryland

[an error occurred while processing this directive]