I’ve been working to transform some OAI Dublin Core I have into RDF and load it into a local Sesame repository. This has been immensely easier thanks to Jeni Tennison’s Getting Started with RDF and SPARQL Using Sesame and Python.
I acquired the records using another helfpul tool, the SIMILE OAI-RDFizer. You can modify the XSLT transformers built into the tool pretty easily to do whatever kind of output you’d like, but the default transformers take the OAI_DC XML and output RDF XML.
One of the challenges of working with OAI (well, any linked data I guess..) is that you don’t always know how much is there when you start. To handle the open-ended nature of OAI-PMH, oai2rdf hashes a directory structure to store all the ListRecord files it gets back. Jeni’s Python example assumed that you’d have a single file of RDF, not hundreds (thousands!). I added an os.walk loop to iterate through all the oai2rdf directories looking for the RDF files.
Because Jeni’s example expected to just load a single file it used the Sesame PUT method. After I got the os.walk working, I realized that PUT only loads the file at hand and replaces any previous data that was loaded. I changed this to use the POST method so each file is appended to the store.
In order to prevent crashing Sesame, I did need to add a short delay at the end of the loop. I’m still playing around with a localhost version of Sesame and haven’t quite figured out how much I can throw at it on my laptop without causing Sesame to blow up. The localhost version I’m running has been fine for testing things out, but I’m thinking I’ll move this to an AWS instance once I iron out where this is going.
Full code after the jump.
# ###############
# oaipenSesame.py
# This script is designed to work with the output of the SIMILE OAI-RDFizer.
# http://simile.mit.edu/wiki/OAI-PMH_RDFizer
# Modified by: Richard J. Urban
# URL: http://www.richardurban.net
# Twitter: @musebrarian
# Based on Jeni Tennison's example
# http://www.jenitennison.com/blog/node/153
import urllib
import httplib2
import os
import string
import time
from datetime import datetime
import re
timestamp = datetime.now()
# Enter a working directory to start from.
wrkDir = raw_input("Enter directory path: ")
if len(wrkDir) != 0:
wrkDir = string.rstrip(wrkDir)
else:
wrkDir = 'DEFAULT' # Enter a DEFAULT working directory
os.chdir(wrkDir)
print("\n")
# Enter a Sesame Repository Name.
repository = raw_input("Enter Sesame repository name: ")
if len(repository) != 0:
repository = string.rstrip(repository)
else:
repository = 'DEFAULT'
print("\n")
graph = raw_input("Enter graph name: ")
if len(graph) != 0:
graph = string.rstrip(graph)
else:
graph = 'http://ex.org/default/'+ timestamp.isoformat() #Enter a DEFAULT graph name. This will add a timestamp to distinguish default graphs loaded at different times.
for dirname, dirnames, filenames in os.walk(wrkDir):
for subdirname in dirnames:
pass
for filename in filenames:
print("\n")
filepath = os.path.join(dirname, filename)
print "Loading %s into %s in Sesame" % (filepath, graph)
params = { 'context': '<' + graph + '>' }
endpoint = "http://localhost:8080/openrdf-sesame/repositories/%s/statements?%s" % (repository, urllib.urlencode(params))
data = open(filepath, 'r').read()
(response, content) = httplib2.Http().request(endpoint, 'POST', body=data, headers={ 'content-type': 'application/rdf+xml' })
print "Response %s" % response.status
print content
#My Sesame can't seem to keep up. This lets it wait until the file is loaded before moving onto the next one.
time.sleep(.5)
