[an error occurred while processing this directive] [an error occurred while processing this directive]

Sources for RDF data.

by Ronald P. Reck - 2002

Your mileage may vary. The following description is more about preparing this data for loading into Parka, than about these data sources per se. In fact, for now I have needed to filter out generated predicates from RDF containers until Parka can work with more than 300 predicates. Otherwise, the number of unique predicates would be much larger.

musicbrainz.org

- ftp://ftp.musicbrainz.org/pub

UNIQUE ELEMENTS - 4/09/02
Predicates1903
Subjects519105
Objects964357
Assertions2468058
The full range of parka assertion essential characters are used in the Musicbrainz data requiring global search and replace of "&" instead of "+", removal of parenthesis "(" and ")", and subsitution of "+" for "#".

Project Gutenberg

Unique elements in pg.parka
Subjects 5204
Objects 70857
Assertions 144660
Predicates31
  • creator
  • date
  • description
  • format
  • language
  • pg^andCount
  • pg^andFreq
  • pg^asFreq
  • pg^author
  • pg^byCount
  • pg^byFreq
  • pg^characterCount
  • pg^compiler
  • pg^edition
  • pg^etextNumber
  • pg^forCount
  • pg^forFreq
  • pg^fromCount
  • pg^fromFreq
  • pg^lineCount
  • pg^metaVersion
  • pg^myCount
  • pg^myFreq
  • pg^numberSeries
  • pg^releaseMonth
  • pg^releaseYear
  • pg^theCount
  • pg^theFreq
  • pg^title
  • pg^wordCount
  • subject
DATA SOURCENTriples
I have been interested in encouraging computers to analyze large amounts of texts like Project Gutenberg. I have created a methodology for creating string frequency reports using entirely open source software. The reports are here. RDF data for the PG texts comes from a parser I created called makemetafile.pl

RDFIG IRC logs

Unique elements in rdfig.parka.trim
Subjects 462866
Objects 631525
Assertions1542668
Predicates118
  • 22-rdf-syntax-ns^_9
  • 22-rdf-syntax-ns^_90
  • 22-rdf-syntax-ns^_900
  • 22-rdf-syntax-ns^_901
  • 22-rdf-syntax-ns^_902
  • 22-rdf-syntax-ns^_903
  • 22-rdf-syntax-ns^_904
  • 22-rdf-syntax-ns^_905
  • 22-rdf-syntax-ns^_906
  • 22-rdf-syntax-ns^_907
  • 22-rdf-syntax-ns^_908
  • 22-rdf-syntax-ns^_909
  • 22-rdf-syntax-ns^_91
  • 22-rdf-syntax-ns^_910
  • 22-rdf-syntax-ns^_911
  • 22-rdf-syntax-ns^_912
  • 22-rdf-syntax-ns^_913
  • 22-rdf-syntax-ns^_914
  • 22-rdf-syntax-ns^_915
  • 22-rdf-syntax-ns^_916
  • 22-rdf-syntax-ns^_917
  • 22-rdf-syntax-ns^_918
  • 22-rdf-syntax-ns^_919
  • 22-rdf-syntax-ns^_92
  • 22-rdf-syntax-ns^_920
  • 22-rdf-syntax-ns^_921
  • 22-rdf-syntax-ns^_922
  • 22-rdf-syntax-ns^_923
  • 22-rdf-syntax-ns^_924
  • 22-rdf-syntax-ns^_925
  • 22-rdf-syntax-ns^_926
  • 22-rdf-syntax-ns^_927
  • 22-rdf-syntax-ns^_928
  • 22-rdf-syntax-ns^_929
  • 22-rdf-syntax-ns^_93
  • 22-rdf-syntax-ns^_930
  • 22-rdf-syntax-ns^_931
  • 22-rdf-syntax-ns^_932
  • 22-rdf-syntax-ns^_933
  • 22-rdf-syntax-ns^_934
  • 22-rdf-syntax-ns^_935
  • 22-rdf-syntax-ns^_936
  • 22-rdf-syntax-ns^_937
  • 22-rdf-syntax-ns^_938
  • 22-rdf-syntax-ns^_939
  • 22-rdf-syntax-ns^_94
  • 22-rdf-syntax-ns^_940
  • 22-rdf-syntax-ns^_941
  • 22-rdf-syntax-ns^_942
  • 22-rdf-syntax-ns^_943
  • 22-rdf-syntax-ns^_944
  • 22-rdf-syntax-ns^_945
  • 22-rdf-syntax-ns^_946
  • 22-rdf-syntax-ns^_947
  • 22-rdf-syntax-ns^_948
  • 22-rdf-syntax-ns^_949
  • 22-rdf-syntax-ns^_95
  • 22-rdf-syntax-ns^_950
  • 22-rdf-syntax-ns^_951
  • 22-rdf-syntax-ns^_952
  • 22-rdf-syntax-ns^_953
  • 22-rdf-syntax-ns^_954
  • 22-rdf-syntax-ns^_955
  • 22-rdf-syntax-ns^_956
  • 22-rdf-syntax-ns^_957
  • 22-rdf-syntax-ns^_958
  • 22-rdf-syntax-ns^_959
  • 22-rdf-syntax-ns^_96
  • 22-rdf-syntax-ns^_960
  • 22-rdf-syntax-ns^_961
  • 22-rdf-syntax-ns^_962
  • 22-rdf-syntax-ns^_963
  • 22-rdf-syntax-ns^_964
  • 22-rdf-syntax-ns^_965
  • 22-rdf-syntax-ns^_966
  • 22-rdf-syntax-ns^_967
  • 22-rdf-syntax-ns^_968
  • 22-rdf-syntax-ns^_969
  • 22-rdf-syntax-ns^_97
  • 22-rdf-syntax-ns^_970
  • 22-rdf-syntax-ns^_971
  • 22-rdf-syntax-ns^_972
  • 22-rdf-syntax-ns^_973
  • 22-rdf-syntax-ns^_974
  • 22-rdf-syntax-ns^_975
  • 22-rdf-syntax-ns^_976
  • 22-rdf-syntax-ns^_977
  • 22-rdf-syntax-ns^_978
  • 22-rdf-syntax-ns^_979
  • 22-rdf-syntax-ns^_98
  • 22-rdf-syntax-ns^_980
  • 22-rdf-syntax-ns^_981
  • 22-rdf-syntax-ns^_982
  • 22-rdf-syntax-ns^_983
  • 22-rdf-syntax-ns^_984
  • 22-rdf-syntax-ns^_985
  • 22-rdf-syntax-ns^_986
  • 22-rdf-syntax-ns^_987
  • 22-rdf-syntax-ns^_988
  • 22-rdf-syntax-ns^_989
  • 22-rdf-syntax-ns^_99
  • 22-rdf-syntax-ns^_990
  • 22-rdf-syntax-ns^_991
  • 22-rdf-syntax-ns^_992
  • 22-rdf-syntax-ns^_993
  • 22-rdf-syntax-ns^_994
  • 22-rdf-syntax-ns^_995
  • 22-rdf-syntax-ns^_996
  • 22-rdf-syntax-ns^_997
  • 22-rdf-syntax-ns^_998
  • 22-rdf-syntax-ns^_999
  • 22-rdf-syntax-ns^type
  • chatEventList
  • creator
  • date
  • description
  • nick
  • relation
DATA SOURCENTriples
This data comes in logfiles for each day, so this process worked for processing the entire collection:
  1. View the page indicated by the URL above, and save the page locally.
  2. To get a list of only the rdf files, type:
    grep rdf rdfig.html |cut -f2 -d">" |cut -f2 -d"=" |tr -d '"' >rdfigfiles.txt
  3. Next to get all the files locally, type:
    #!/bin/bash
    for file in `cat rdfigfiles.txt`
    do
    wget $file
    done

dmoz.org

Unique elements in dmoz11.parka
Subjects3384124
Objects6637856
Assertions19097541
Predicates5
  • id
  • rdfExternalPage
  • rdfTopic
  • rdfabout
  • resource
DATA SOURCE NTriples
There were problems parsing this with Raptor so I ended up removing lines with data that Raptor didnt like. For that purpose, I created nukeline. The limitations of nukeline are that it only removes a line, and you need to know WHICH line. It took approximately 11 iterations of running raptor, then nukeline, then raptor, lather, rinse, repeat (about 3 hours on a 1.5 Ghz with 512 RAM).

Oncogene2.daml

- http://www.cs.umd.edu/~hendler/2002/Oncogene2.daml

Unique elements in Oncogene.parka
Subjects 206
Objects 348
Assertions 1140
Predicates12
  • 22-rdf-syntax-ns^object
  • 22-rdf-syntax-ns^predicate
  • 22-rdf-syntax-ns^subject
  • 22-rdf-syntax-ns^type
  • Oncogene.daml^Found_In_Organism
  • Oncogene.daml^Gene_Associated_With_Disease
  • Oncogene.daml^Gene_Has_Function
  • Oncogene.daml^In_Chromosomal_Location
  • Oncogene.daml^code
  • Oncogene.daml^id
  • daml+oil^comment
  • daml+oil^versionInfo
DATA SOURCE

WordNet

- http://www.semanticweb.org/library/

Unique elements in wordnet.parka
Subjects 99655
Objects 251015
Assertions 473631
Predicates4
  • glossaryEntry
  • hyponymOf
  • similarTo
  • wordForm
DATA SOURCENTriples


SEC Edgar


Unique elements in sec.parka
Subjects 8486
Objects 85631
Assertions 434983
Predicates118
  • description
  • sec^0
  • sec^3.50
  • sec^3Months3-12Months1-5YearsOver5YearsTotal
  • sec^ACCESSION-NUMBER
  • sec^ACT
  • sec^ALLOWANCES
  • sec^ARTICLE
  • sec^ASSIGNED-SIC
  • sec^BARCHARTHERE
  • sec^BONDS
  • sec^BUSINESS-ADDRESS
  • sec^C
  • sec^CAPTON
  • sec^CASH
  • sec^CGS
  • sec^CHANGES
  • sec^CIK
  • sec^CITY
  • sec^COMMON
  • sec^COMPANY-DATA
  • sec^CONFIRMING-COPY
  • sec^CONFORMED-NAME
  • sec^CURRENT-ASSETS
  • sec^CURRENT-LIABILITIES
  • sec^Caption
  • sec^DATE-CHANGED
  • sec^DATE-OF-FILING-DATE-CHANGE
  • sec^DEPRECIATION
  • sec^DISCONTINUED
  • sec^DOCUMENT
  • sec^DOCUMENT-COUNT
  • sec^EFFECTIVENESS-DATE
  • sec^EPS-BASIC
  • sec^EPS-DILUTED
  • sec^EPS-PRIMARY
  • sec^EXTRAORDINARY
  • sec^F1
  • sec^FILE-NUMBER
  • sec^FILED-BY
  • sec^FILENAME
  • sec^FILER
  • sec^FILING-DATE
  • sec^FILING-VALUES
  • sec^FILM-NUMBER
  • sec^FISCAL-YEAR-END
  • sec^FN
  • sec^FORM-TYPE
  • sec^FORMER-COMPANY
  • sec^FORMER-CONFORMED-NAME
  • sec^GROUP-MEMBERS
  • sec^INCOME-CONTINUING
  • sec^INCOME-PRETAX
  • sec^INCOME-TAX
  • sec^INTEREST-EXPENSE
  • sec^INVENTORY
  • sec^IRS-NUMBER
  • sec^ITEMS
  • sec^Ifshipdate30days-rescheduleupto30days
  • sec^IncomebeforetaxesNote
  • sec^InvestmentsatmarketvalueF1
  • sec^LOSS-PROVISION
  • sec^MAIL-ADDRESS
  • sec^NET-INCOME
  • sec^NOTIFY-INTERNET
  • sec^NetincomeNote
  • sec^Note
  • sec^OTHER-EXPENSES
  • sec^OTHER-SE
  • sec^OutlookF1
  • sec^PAPER
  • sec^PDF
  • sec^PERIOD
  • sec^PERIOD-END
  • sec^PERIOD-START
  • sec^PERIOD-TYPE
  • sec^PHONE
  • sec^PP-E
  • sec^PREFERRED
  • sec^PREFERRED-MANDATORY
  • sec^PUBLIC-DOCUMENT-COUNT
  • sec^Page
  • sec^Page2
  • sec^R
  • sec^RECEIVABLES
  • sec^REFERENCES-429
  • sec^REGISTEREDREGISTEREDSHAREF1
  • sec^RELATIONSHIP
  • sec^REPORTING-OWNER
  • sec^Requirement0.50to1.0
  • sec^SALES
  • sec^SCRIPPSCO.E.W.
  • sec^SECURITIES
  • sec^SEQUENCE
  • sec^SERIAL-COMPANY
  • sec^STATE
  • sec^STATE-OF-INCORPORATION
  • sec^STREET1
  • sec^STREET2
  • sec^SUBJECT-COMPANY
  • sec^TEXT
  • sec^TOTAL-ASSETS
  • sec^TOTAL-COSTS
  • sec^TOTAL-LIABILITY-AND-EQUITY
  • sec^TOTAL-REVENUES
  • sec^TYPE
  • sec^Table
  • sec^TotalDebtto3.00
  • sec^XF1
  • sec^ZIP
  • sec^c
  • sec^caption
  • sec^fn
  • sec^pAGE
  • sec^page
  • sec^pound-sterling
  • sec^s
  • sec^table
DATA SOURCENTriples

Lawyer Run Screen Scrape


Unique elements in lawyerRun.parka
Subjects 1812
Objects 6927
Assertions
Predicates8
  • 22-rdf-syntax-ns^type
  • running.rdf^age
  • running.rdf^chipTime
  • running.rdf^divisionPlace
  • running.rdf^gunTime
  • running.rdf^hometown
  • running.rdf^pace
  • running.rdf^runnerName
DATA SOURCENTriples


This page modified Monday, 23-Sep-2002 08:39:34 EDT