08/11/11

Reconciliation Recap

@jonvoss asked what I’d been up to related to reconciling my data, so here’s a brief account of what I’ve done over the last few weeks.   Much of this is proof-of-concept that will result in recommendations about what IMLS DCC might have to do to move towards Linked Open Data in the future. There are probably more efficient ways to program these tasks, but for the moment I’m using some simple tools that work for me.

In my previous post, I shared an example collection-level record set as RDF. I’ve gone back and simplified this transformation to leave out the representations of institutions and projects. Turns out the URIs that are present will resolve to a vCard RDF representation. e.g. http://imlsdcc.grainger.uiuc.edu/Registry/Institution/?1316 wil return some XML. Maybe not the best representation, but we can work on that as a separate problem. This has the benefit of making the CLD instances simpler. I made a small change that will still associate a project with a funding agency (to demonstrate the contributions they’ve made).

Using the SIMILE Gadget tool, I’ve also extracted unique terms & frequency counts from the CLD records(1). These terms/frequencies are then imported into Google Refine and reconciled against appropriate LOD data:

Using Freebase has been pretty painless.  When a column of terms is reconciled,  Refine stores the Freebase ID.   To get the Freebase URI,  simply create a “New Column Based on This One” using the following GREL

“http://rdf.freebase.com/ns/m/”+cell.recon.match.id

Using this Freebase URL I can replace the literal statement

<dcterms:spatial>Illinois (state) </dcterms:spatial>

into and linked data statement:

<dcterms:spatial rdf:resource=”http://rdf.freebase.com/ns/m/03v0t” rdfs:label=”Illinois (state)” />

Reconciling against id.loc.gov has been more difficult. From my literal values I can create a query string (sometimes) fetch the correct set of triples for a term. This works for most of our terms, though a few uncontrolled terms have been contributed by participants that don’t match. e.g. http://api.talis.com/stores/lcsh-info/items?query=preflabel:photographs&max=1

It is a little sensitive to plural/singular terms. For example the difference between “scrapbooks” and “scrapbook.” Most terms are plural, but there seems to be some distinction I don’t understand between Painting and Paintings.

In Refine I can pull back the RDF for these terms, but am still working how how I might extract the canonical concept URI for each term. This looks like it will require parsing the RDF to get the right URI out of it. If anyone has a good cookbook for reconciling terms with id.loc.gov URIs, I’d love to see it. (something using the Refine ReconciliationServiceAPI would be swell).

I may give our subject headings a twirl, but I may need some subject cataloger help there.  The published LC authorities include headings like “Cemeteries – Recording” but not localized forms “Cemeteries – Recording — Illinois”  Since these are all strings in a dc:subject, some way of parsing the subdivisions is needed.

Update: After posting this, I started playing with my subject headings and found that the LCSH triples were loaded into Freebase in May. (http://www.freebase.com/view/topic/en/loc_subject_headings_full_load). The Refine reconciliation service will pick them by creating a “namespaced” reconciliation service. (point it at the Library of Congress Namespace). Now, let’s get those names & other vocabularies loaded!

(1) I’ve tried to replicate this in Google Refine, but on my computer it seems to choke on the complex XML record structure. It’s quite happy with large tabular representations though.