@jonvoss asked what I’d been up to related to reconciling my data, so here’s a brief account of what I’ve done over the last few weeks. Much of this is proof-of-concept that will result in recommendations about what IMLS DCC might have to do to move towards Linked Open Data in the future. There are probably more efficient ways to program these tasks, but for the moment I’m using some simple tools that work for me.
In my previous post, I shared an example collection-level record set as RDF. I’ve gone back and simplified this transformation to leave out the representations of institutions and projects. Turns out the URIs that are present will resolve to a vCard RDF representation. e.g. http://imlsdcc.grainger.uiuc.edu/Registry/Institution/?1316 wil return some XML. Maybe not the best representation, but we can work on that as a separate problem. This has the benefit of making the CLD instances simpler. I made a small change that will still associate a project with a funding agency (to demonstrate the contributions they’ve made).
Using the SIMILE Gadget tool, I’ve also extracted unique terms & frequency counts from the CLD records(1). These terms/frequencies are then imported into Google Refine and reconciled against appropriate LOD data:
- cld:itemType:TALIS repository of id.loc.gov for Thesaurus of Graphic Materials terms.
- cld:itemFormat: Freebase Media Type scheme (http://www.freebase.com/schema/type/media_type)
- dcterms:spatial: (http://www.freebase.com/view/location)
Using Freebase has been pretty painless. When a column of terms is reconciled, Refine stores the Freebase ID. To get the Freebase URI, simply create a “New Column Based on This One” using the following GREL
Using this Freebase URL I can replace the literal statement
into and linked data statement:
Reconciling against id.loc.gov has been more difficult. From my literal values I can create a query string (sometimes) fetch the correct set of triples for a term. This works for most of our terms, though a few uncontrolled terms have been contributed by participants that don’t match. e.g. http://api.talis.com/stores/lcsh-info/items?query=preflabel:photographs&max=1
It is a little sensitive to plural/singular terms. For example the difference between “scrapbooks” and “scrapbook.” Most terms are plural, but there seems to be some distinction I don’t understand between Painting and Paintings.
In Refine I can pull back the RDF for these terms, but am still working how how I might extract the canonical concept URI for each term. This looks like it will require parsing the RDF to get the right URI out of it. If anyone has a good cookbook for reconciling terms with id.loc.gov URIs, I’d love to see it. (something using the Refine ReconciliationServiceAPI would be swell).
I may give our subject headings a twirl, but I may need some subject cataloger help there. The published LC authorities include headings like “Cemeteries – Recording” but not localized forms “Cemeteries – Recording — Illinois” Since these are all strings in a dc:subject, some way of parsing the subdivisions is needed.
Update: After posting this, I started playing with my subject headings and found that the LCSH triples were loaded into Freebase in May. (http://www.freebase.com/view/topic/en/loc_subject_headings_full_load). The Refine reconciliation service will pick them by creating a “namespaced” reconciliation service. (point it at the Library of Congress Namespace). Now, let’s get those names & other vocabularies loaded!
(1) I’ve tried to replicate this in Google Refine, but on my computer it seems to choke on the complex XML record structure. It’s quite happy with large tabular representations though.

Did you mean LC Subject Headings or, as you wrote, Thesaurus of Graphic Materials terms? I’m not 100% certain, but I think the Talis API serves LCSH.
Regardless, ID has the Known-label service.
You’d have to use a little content-negotiation (see again the ID help pages), but URLs, such as the ones below, would get you to the TGM URIs for the given labels:
http://id.loc.gov/vocabulary/graphicMaterials/label/Photographs
http://id.loc.gov/vocabulary/graphicMaterials/label/Paintings
http://id.loc.gov/vocabulary/graphicMaterials/label/Painting
The same service exists for LCSH:
http://id.loc.gov/vocabulary/subjects/label/Photographs
BTW, about the difference between the similar TGM terms “Painting” and “Paintings”: “Painting” (without the ‘s’) refers to the act or trade; “Paintings” (with the ‘s’) references the objects created through the act of “Painting.”
Hi Kevin!
I was working with both TGM terms which appear in the cld:itemType elements in our records, and later with LCSH subject terms which are in dc:subject (or dcterms:LCSH).
wrt to content negotiation, I see the links to other formats on the HTML page that is returned by the known-label service. Is there a way to force it to return those other formats using content negotiation?
Talis does serve LCSH too, but the nice feature of the Freebase mirror is how it easily reconciles in Google Refine (at least for a n00b like me). With TALIS or the LC provided data, I’d still need to pick through the representation to get the URI. The Freebase reconciliation service can return that using the same GREL function noted above.
Thanks for the clarification on terms, part of the confusion is also in the use of either singular/plural variants in the metadata. Couldn’t tell if it was a typo, or a legitimate distinction.
p.s. the full results of this work are now posted as part of our DPLA sprint site.
http://dpla.grainger.illinois.edu/resources/dlf-dcc_collections.rdf
Just stumbled across this. Thanks for writing up your results.
You mentioned problems with a “CLD” XML file. Have you reported this to the Google Refine team? If you can provide an example of one of the problematic files, we can take a look at it. I fixed a couple of XML related bugs for the upcoming release, so it’s possible it’ll work now, but it’d be good to check.
Thanks Tom,
I did not report the problems I experienced with XML files – I guess I chalked it up to their complexity. But I also haven’t looked at this since the last update to Google Refine. I’ll be take a look at the latest release in the next few months to see if it handles this type of XML better.