Wrapper Induction

Notes

Questions/Suggestions Area

Chat

Laks :

Lin:

David : A cwm n3 script for scraping to the ontology, so far:

@prefix na: <http://www.wam.umd.edu/~krakatoa/cs828y/second/nsua.daml#> . 
@prefix dc: <http://purl.org/dc/elements/1.1/> . 
 
@prefix log: <http://www.w3.org/2000/10/swap/log#> . 
@prefix str: <http://www.w3.org/2000/10/swap/string#> . 
 
@prefix : <?#>. 
 
 
this log:forAll :nsuXmlDoc , :nsuUri , :nsuContent , :nsuArtType , :articleidlist, :body , :i , 
        :fm , :nsLead , :nsTitle, :nsCreator . 
{ 
        :nsuXmlDoc :href :i . 
        :nsuUri log:uri :i . 
        :nsuUri log:content :nsuContent . 
        (:nsuContent "<nsuarticle type=\"(.+?)\">") str:scrape :nsuArtType . 
        (:nsuContent "<articleidlist>(.*)</articleidlist>") str:scrape :articleidlist . 
        (:nsuContent "<fm>(.*?)</fm>") str:scrape :fm . 
        (:fm "<title>(.*?)</title>") str:scrape :nsTitle. 
        (:fm "<aug>(.*?)</aug>") str:scrape :nsCreator. 
         
        (:fm "<standfirst>(.*?)</standfirst>") str:scrape :nsLead. 
} log:implies { 
        [       a na:NSUArticle ; 
                na:articleType :nsuArtType ; 
                dc:title :nsTitle ; 
                na:lead :nsLead  
                ] . 
} . 
 
[ :href "http://www.mindswap.org/2002/nature/000127-6.xml" ] . 

To execute this, call "python cwm natureDocuments.n3 --filter=simpleScraper.n3 --rdf > sample.xrdf". I also wrote a slightly smarter scraper, but it generates a bunch of temporary properties that I haven't figured out how to remove yet.


[FrontPage] [TitleIndex] [WordIndex]