[an error occurred while processing this directive] [an error occurred while processing this directive] Demo of RDF Web Scraper Version 1.1

Demo of RDF Web Scraper Version 1.1:

Here is an example of how the web scraper can be used to extract RDF  from the following web page: http://www.w3.org/People/all?pictures=yes

Step 1: Dataset Classification

We start by determining the information we want to scrape from the page in terms of records and fields. In this case for example, each record would constitute information about a single person, and the corresponding fields could be the person's name, title, image URL, home page and e-mail id. Now we need to parse the web page in order to obtain this information in a tabular format (records| fields). For this, we look at the source html to see which tags can be used to parse the data.

Step 2: Creating a Wrapper to Parse the Data

The next few paragraphs explain the functioning of the parser used in the tool and the technique used to specify the tags in order to scrape the records| fields correctly. The parser is a bit complex and may require some trial and error to understand completely, but it manages to work efficiently in most cases.  

 Here is a snippet of the html code from the above web page:

<H3><a name="timbl%40w3.org"></a><a name="timbl">Tim Berners-Lee</A>, Director</H3>
<img border="0" src="/People/tbl" align="left" hspace="7" vspace="5" alt="">
<a href="Berners-Lee/">Tim</a>
invented the <a href="../">World Wide Web</a> in late 1990 while working at
CERN, the European Particle Physics Laboratory in Geneva, Switzerland. He
wrote the first WWW client (a browser-editor running under NeXTStep) and
the first WWW server along with most of the communications software, defining
URLs, HTTP and HTML. Prior to his work at CERN, Tim was a founding director of
Image Computer Systems, a consultant in hardware and software system design,
real-time communications graphics and text processing, and a principal
engineer with Plessey Telecommunications in Poole, England. He is a graduate
of Oxford University. <a href="Berners-Lee/">More...</a>
<p>
Tim is now the overall Director of the W3C. He is a Principal Research
Scientist at the MIT Laboratory for Computer Science.
<BR clear="all">
<ADDRESS><A href="Berners-Lee"><IMG src="/Icons/house.gif" alt="homepage" border="0" hspace="10"></A><a href="mailto:timbl@w3.org"><IMG src="/Icons/envelope.gif" alt="email" border="0"> timbl@w3.org</a></address>

As can be seen from the html above, the first field we require is the name of the person (Tim Berners-Lee) and its present after the tag <a name="timbl">, the next field is the title (Director) and its present after the tag </A>, the next field is the image URL (/People/tbl) and its present inside the tag <img border="0" src="/People/tbl" align="left" hspace="7" vspace="5" alt="">, the next field is the home page (Berners-Lee) and its present inside the tag <A href="Berners-Lee"> and the last field is the e-mail id  (timbl@w3.org) and its present after the tag <IMG src="/Icons/house.gif" alt="homepage" border="0" hspace="10">

Thus, notice that  in order to scrape standard text in the web page, we look at its previous html tag(s) in the source and for scraping image/hyperlinked URLs, we look inside the tag. Once these tags are determined, we can specify any string within this tag as the delimiter in the "Previous Tag" column in the "Fields"  table as shown in the screenshot below. However, it is important that we choose these strings carefully (by looking at the source of the other records in the web page and note the similarities in the tags and their sequence, and choosing unique strings within the tags) to prevent incorrect parsing. 

Also note the importance of the List start/end and Record start/end tags. The List tags specify the region within the web page where parsing will occur and can be used to remove extraneous html. The Record tags specify delimiters for a single record and as can be seen below, multiple record tags can be specified. In this case, the first three fields i.e. name, title and image URL are present within the first record tags set (<H3 and <BR), hence the corresponding value of the "Associated Record #" in the Fields Table is 1; while the last two fields i.e. home page and e-mail are present within the second set of record tags (<ADDRESS and </address), hence the corresponding value of the "Associated Record #" for these two fields is 2. Finally, while specifying any tags, note that the operation is case-sensitive.

 

Step 3: Specifying the RDF Templates and Generating the RDF

After the data has been parsed and the necessary information obtained in a tabular format (see screenshot above), we need to enter the list of ontologies we wish to use while generating the RDF. For each Ontology, we specify its Prefix and corresponding URL in the boxes provided (as shown above) and click the "Add Ontology to List" button. In this example, we use the http://xmlns.com/foaf/0.1/ onotology with foaf as the prefix 

After we've specified the ontology list, we create a series of templates that are used to extract RDF code from the records table.  The template table works as follows: Each row in the table corresponds to a single RDF template that is applied to all the records in the Record Table. Thus, in the example above, our template has the following fields:

Type: Resource (other options being Bag/Seq/Alt, described later) - This implies that template is describing a single RDF Resource

Name/ID: @4 - This implies that the Resource Description element (i.e. corresponding to rdf:Description About or rdf:Description ID) is in Column 4 of the Record Table. In this case, we've selected the home page since its unique for different people. Alternatively, one could choose the e-mail ID as the main description element, in which case, the value of this field would be @5. Also, one could specify a string literal as the Name/ID instead of a particular column of the Record Table.

@1: name -  This implies that for the current resource described by this RDF template, the 'name' property has a value present in Column 1 of the Record Table. In this case, since we are using only a single namespace, the ontology prefix before the property 'name' is not specified. To be more explicit, the value of this field should be foaf:name where foaf is the ontology prefix of an ontology present in the List. Alternatively, one could specify a series of prefixes such as dc:person,foaf:name in this field and the program will automatically generate the nested RDF.

@2: title - Similar to the previous case, this value implies that for the current resource described by this RDF template, the 'title' property has a value present in Column 2 of the Record Table

@3: image - Similarly, this implies that for the current resource described by this RDF template, the 'image' property has a value present in Column 3 of the Record Table

@4: This value must be kept blank since its being used as the main description element (by specifying @4 it in the Name/ID field)

@5: mbox - Similar to the other columns, this implies that for the current resource described by this RDF template, the 'mbox' property has a value present in Column 5 of the Record Table

Once we've specified the RDF templates, we click on the "View RDF" button to generate the RDF as shown in the screenshot below:

Note that the homepage and image URLs have been expanded in the generated RDF (as shown above) by specifying the "Base URI" in the textbox and clicking the "Expand Relative URI" check-box. Also note that since the parsing might not be entirely accurate due to discrepancy in web page formats and html code, certain cell entries in the Record Table might be invalid/garbage. These can be cleared by selecting the appropriate "Cell Selection Mode " value and clicking the invalid cells. More information on the “Cell Selection Mode” is present in the Additional Comments segment.

The entire RDF generated for the example above, is present here

 

Additional Comments:

 

 

 

 

and the resultant output RDF will look something like this:

<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:foaf="http://xmlns.com/foaf/0.1/">

<rdf:Description about="Berners-Lee">
<name> Tim Berners-Lee </name>
<title> Director </title>
<image> rdf:resource="/People/tbl" </image>
<mbox> rdf:resource="timbl@w3.org" </mbox>
<address> Cambridge MA </address>
   // Note: address  information is only present in this record
</rdf:Description>

<rdf:Description about="Brewer">
<name> Judy Brewer </name>
<title> Domain Leader </title>
<image> rdf:resource="/People/jb.jpg" </image>
<mbox> rdf:resource="jbrewer@w3.org" </mbox>
</rdf:Description>

<rdf:Description about="Connolly">
<name> Dan Connolly </name>
<title> Technical Staff </title>
<image> rdf:resource="/People/dc" </image>
<mbox> rdf:resource="connolly@w3.org" </mbox>
</rdf:Description>

 

 I hope most of the functionality of the RDF Web Scraper is clear by now. If not, another markup example (Demo 2) is available here. This covers features that aren’t covered in this section (see Additional Comments above)

[an error occurred while processing this directive]