Sources for RDF data. |
by Ronald P. Reck - 2002 |
| Your mileage may vary. The following description is more about
preparing this data for loading into Parka, than about these data sources per se.
In fact, for now I have needed to filter out generated
predicates from RDF containers until Parka can work with more than 300 predicates.
Otherwise, the number of unique predicates would be much larger.
|
musicbrainz.org -
ftp://ftp.musicbrainz.org/pub
| UNIQUE ELEMENTS - 4/09/02 |
| Predicates | 1903 |
| Subjects | 519105 |
| Objects | 964357 |
| Assertions | 2468058 |
|
| The full range of parka assertion essential characters
are used in the Musicbrainz data requiring global search and replace
of "&" instead of "+", removal of parenthesis "(" and ")", and subsitution
of "+" for "#".
|
|
Project Gutenberg
| Unique elements in
pg.parka |
| Subjects | 5204
|
| Objects | 70857
|
| Assertions | 144660
|
| Predicates | 31 |
- creator
- date
- description
- format
- language
- pg^andCount
- pg^andFreq
- pg^asFreq
- pg^author
- pg^byCount
- pg^byFreq
- pg^characterCount
- pg^compiler
- pg^edition
- pg^etextNumber
- pg^forCount
- pg^forFreq
- pg^fromCount
- pg^fromFreq
- pg^lineCount
- pg^metaVersion
- pg^myCount
- pg^myFreq
- pg^numberSeries
- pg^releaseMonth
- pg^releaseYear
- pg^theCount
- pg^theFreq
- pg^title
- pg^wordCount
- subject
|
| DATA SOURCE | NTriples |
|
|
I have been interested in encouraging
computers to analyze large amounts of
texts like Project Gutenberg. I have
created a methodology for creating string frequency
reports using entirely open source software.
The reports are here.
RDF data for the PG texts comes from
a parser I created called
makemetafile.pl
|
|
RDFIG IRC logs
|
This data comes in logfiles for each day, so
this process worked for processing the entire collection:
- View the page indicated by the URL above, and save the page
locally.
- To get a list of only the rdf files, type:
grep rdf rdfig.html |cut -f2 -d">" |cut -f2 -d"=" |tr -d '"' >rdfigfiles.txt
- Next to get all the files locally, type:
#!/bin/bash
for file in `cat rdfigfiles.txt`
do
wget $file
done
|
dmoz.org
|
| Unique elements in
dmoz11.parka |
| Subjects | 3384124
|
| Objects | 6637856
|
| Assertions | 19097541
|
| Predicates | 5 |
- id
- rdfExternalPage
- rdfTopic
- rdfabout
- resource
|
|
DATA
SOURCE |
NTriples |
|
|
There were problems parsing this with Raptor
so I ended up removing lines with data that Raptor
didnt like. For that purpose, I created
nukeline.
The limitations
of nukeline are that it only removes a line, and you
need to know WHICH line. It took approximately 11 iterations
of running raptor, then nukeline, then raptor, lather, rinse, repeat (about 3 hours on a 1.5
Ghz with 512 RAM).
|
|
Oncogene2.daml -
http://www.cs.umd.edu/~hendler/2002/Oncogene2.daml
| Unique elements in
Oncogene.parka |
| Subjects | 206
|
| Objects | 348
|
| Assertions | 1140
|
| Predicates | 12 |
- 22-rdf-syntax-ns^object
- 22-rdf-syntax-ns^predicate
- 22-rdf-syntax-ns^subject
- 22-rdf-syntax-ns^type
- Oncogene.daml^Found_In_Organism
- Oncogene.daml^Gene_Associated_With_Disease
- Oncogene.daml^Gene_Has_Function
- Oncogene.daml^In_Chromosomal_Location
- Oncogene.daml^code
- Oncogene.daml^id
- daml+oil^comment
- daml+oil^versionInfo
|
|
DATA SOURCE | |
|
|
WordNet -
http://www.semanticweb.org/library/
| Unique elements in
wordnet.parka |
| Subjects | 99655
|
| Objects | 251015
|
| Assertions | 473631
|
| Predicates | 4 |
- glossaryEntry
- hyponymOf
- similarTo
- wordForm
|
|
DATA SOURCE | NTriples |
|
|
|
SEC Edgar
|
|
Lawyer Run Screen Scrape
| Unique elements in
lawyerRun.parka |
| Subjects | 1812
|
| Objects | 6927
|
| Assertions |
|
| Predicates | 8 |
- 22-rdf-syntax-ns^type
- running.rdf^age
- running.rdf^chipTime
- running.rdf^divisionPlace
- running.rdf^gunTime
- running.rdf^hometown
- running.rdf^pace
- running.rdf^runnerName
|
| DATA SOURCE | NTriples |
|
|
|