I’ve been working to transform some OAI Dublin Core I have into RDF and load it into a local Sesame repository. This has been immensely easier thanks to Jeni Tennison’s Getting Started with RDF and SPARQL Using Sesame and Python.
I acquired the records using another helfpul tool, the SIMILE OAI-RDFizer. You can modify the XSLT transformers built into the tool pretty easily to do whatever kind of output you’d like, but the default transformers take the OAI_DC XML and output RDF XML.
One of the challenges of working with OAI (well, any linked data I guess..) is that you don’t always know how much is there when you start. To handle the open-ended nature of OAI-PMH, oai2rdf hashes a directory structure to store all the ListRecord files it gets back. Jeni’s Python example assumed that you’d have a single file of RDF, not hundreds (thousands!). I added an os.walk loop to iterate through all the oai2rdf directories looking for the RDF files.
Because Jeni’s example expected to just load a single file it used the Sesame PUT method. After I got the os.walk working, I realized that PUT only loads the file at hand and replaces any previous data that was loaded. I changed this to use the POST method so each file is appended to the store.
In order to prevent crashing Sesame, I did need to add a short delay at the end of the loop. I’m still playing around with a localhost version of Sesame and haven’t quite figured out how much I can throw at it on my laptop without causing Sesame to blow up. The localhost version I’m running has been fine for testing things out, but I’m thinking I’ll move this to an AWS instance once I iron out where this is going.
Full code after the jump.