09/21/12

Publishing and Using Linked Data at DHWI

In January I will be conducting a week-long workshop on Publishing and Using Linked data as part of the Digital Humanities Winter Institute at the Maryland Institute for Technology in the Humanities.  Space is still available, so register today!

The publication of structured knowledge representations and open data on the Web opens new possibilities for collaboration among humanities researchers and cultural heritage organizations. This course will introduce participants to the core principles of Linked Open Data (LOD), techniques for building and understanding LOD models, how to locate LOD sources for research, tools for manipulating, visualizing, and integrating available data, and best practice methodologies for publicizing and sharing datasets.

For this course I will be drawing from initial work done by the Learning Linked Data project at the University of Washington iSchool, which has laid out a core inventory of learning topics.  The LOD for Libraries, Archives, and Museums community has also been actively promoting access to increasing amounts of cultural heritage information via Linked Data approaches.  Some of the questions we’ll be exploring in the workshop are:

  • what does the digital humanities community need from linked data
  • what use can we make of these large data sets
  • how we can synchronize scholarly work with the larger linked data community.

To help gain momentum for the workshop, I’ve created a wiki, called Linked Data for Humanities where I will be sharing a drafts of the syllabus, resources, and example humanities projects.   (a big hat tip to Mia Ridge and the Museums and the Machine Processable Web wiki, which has been an important resource for the LODLAM community).   If you have a humanities-based Linked Data project,  questions, comments, or recommendations for things the course should cover, please join in the conversation.

08/11/11

Reconciliation Recap

@jonvoss asked what I’d been up to related to reconciling my data, so here’s a brief account of what I’ve done over the last few weeks.   Much of this is proof-of-concept that will result in recommendations about what IMLS DCC might have to do to move towards Linked Open Data in the future. There are probably more efficient ways to program these tasks, but for the moment I’m using some simple tools that work for me.

In my previous post, I shared an example collection-level record set as RDF. I’ve gone back and simplified this transformation to leave out the representations of institutions and projects. Turns out the URIs that are present will resolve to a vCard RDF representation. e.g. http://imlsdcc.grainger.uiuc.edu/Registry/Institution/?1316 wil return some XML. Maybe not the best representation, but we can work on that as a separate problem. This has the benefit of making the CLD instances simpler. I made a small change that will still associate a project with a funding agency (to demonstrate the contributions they’ve made).

Using the SIMILE Gadget tool, I’ve also extracted unique terms & frequency counts from the CLD records(1). These terms/frequencies are then imported into Google Refine and reconciled against appropriate LOD data:

Using Freebase has been pretty painless.  When a column of terms is reconciled,  Refine stores the Freebase ID.   To get the Freebase URI,  simply create a “New Column Based on This One” using the following GREL

“http://rdf.freebase.com/ns/m/”+cell.recon.match.id

Using this Freebase URL I can replace the literal statement

<dcterms:spatial>Illinois (state) </dcterms:spatial>

into and linked data statement:

<dcterms:spatial rdf:resource=”http://rdf.freebase.com/ns/m/03v0t” rdfs:label=”Illinois (state)” />

Reconciling against id.loc.gov has been more difficult. From my literal values I can create a query string (sometimes) fetch the correct set of triples for a term. This works for most of our terms, though a few uncontrolled terms have been contributed by participants that don’t match. e.g. http://api.talis.com/stores/lcsh-info/items?query=preflabel:photographs&max=1

It is a little sensitive to plural/singular terms. For example the difference between “scrapbooks” and “scrapbook.” Most terms are plural, but there seems to be some distinction I don’t understand between Painting and Paintings.

In Refine I can pull back the RDF for these terms, but am still working how how I might extract the canonical concept URI for each term. This looks like it will require parsing the RDF to get the right URI out of it. If anyone has a good cookbook for reconciling terms with id.loc.gov URIs, I’d love to see it. (something using the Refine ReconciliationServiceAPI would be swell).

I may give our subject headings a twirl, but I may need some subject cataloger help there.  The published LC authorities include headings like “Cemeteries – Recording” but not localized forms “Cemeteries – Recording — Illinois”  Since these are all strings in a dc:subject, some way of parsing the subdivisions is needed.

Update: After posting this, I started playing with my subject headings and found that the LCSH triples were loaded into Freebase in May. (http://www.freebase.com/view/topic/en/loc_subject_headings_full_load). The Refine reconciliation service will pick them by creating a “namespaced” reconciliation service. (point it at the Library of Congress Namespace). Now, let’s get those names & other vocabularies loaded!

(1) I’ve tried to replicate this in Google Refine, but on my computer it seems to choke on the complex XML record structure. It’s quite happy with large tabular representations though.

07/4/11

Piloting collection-level LOD for IMLS DCC

On Metadata
Since the #lodlam conference, I haven’t had much chance to play around with my shipyard LOD — the dissertation calls. Plus I’m spending about half my time this summer as part of the team working on the CLIR/DLF/IMLS DCC Beta Sprint for the Digital Public Library of America (DPLA).

What follows is a bit of skunkworkery that I’m doing for self-edification & also to help suggest ways we can make IMLS DCC data more LOD friendly.**  Currently people can browse the site at http://imlsdcc.grainger.illinois.edu/history or as XML via OAI-PMH for collection-level and item-level metadata.   As part of the Collection/Item Metadata Working Group (CIMR),  I helped build an RDF testbed that was oriented towards our research problems.

Using some of the stylesheets developed for CIMR, I’ve generated LOD representations for the currently available collection-level records.  When the rubber hits the road like this, there are lots of design choices you can make – in terms of encodings,  which vocabularies to use, etc., etc.  Here is a sample set of records and the XSLT used to generate them from the OAI-PMH, imlsdcc listRecords format.  Some questions:

  • this looks rather complicated.  Maybe that’s OK, as it seems to represent much of the information currently shared publicly by the project. I’d welcome any suggestions for simplifications or better approaches to representing this as LOD.
  • are there best practices for representing organizations as organizations?  FOAF/vCard seem very oriented towards people (who have associations with an organization).  I also picked up the Organization Ontology from Describing Libraries, Their Collections and Services in RDF.
  • Many of the URIs here are just made up for demonstration purposes.
  • There are lots of organizations we have minimal information for. It would be nice to reconcile our URIs with other published URIs for these institutions. What would be the most authoritative source for that LOD?
  • Many organizations aren’t publishing their own “authorized” graphs for themselves.  Is this something a project like IMLS DCC should consider?   I added a stub description of IMLS DCC to this file to demonstrate the relationships between the project and the aggregated collections.
  • Right now this RDF mostly contains the strings found in the original XML.  I would like to reconcile controlled terms where possible to existing LOD vocabularies (like id.loc.gov,  language terms, formats, etc.).  I think that would make this data more “linked.”
  • In theory the XSLT above should still work with the SIMILE OAI-PMH RDFizer

Thanks if you have a chance to take a look and offer comments on this.  And do let me know if you’d like to see more of this kind of data!

** Disclaimer: this is some work I’m doing on the side, on my own. Neither the rdf nor the XSLT should be considered an “official” release by the project. Any mistakes here are mine.

05/17/11

On the ways (Part II)

"Galen L. Stone" Interior view, ribbing of tug under construction.  Delaware Public Archives

"Galen L. Stone" Interior view, ribbing of tug under construction. Delaware Public Archives

Tonight I decided to go back to my large table of >1,000 ships and continue doing some clean-up.   However, instead of trying to edit the values I had I gave the Google Refine/Freebase reconciliation service a try.  Boy-howdy I really should have taken @jonvoss’s advice and done this sooner.   I pretty quickly whipped through my Ship Type column and matched to the /boats/ship/ship_type vocabulary in Freebase.   Like any classification task, I think some of the subtleties of my data get lost. For the moment I think that’s OK, but if you care about the difference between a sidewheel paddle steamer and a sternwheel paddle steamer they’ve been lumped together under the same class.

The reconciliation tool made quick work of matching to the companies that make up the majority of the shipyards in my database, which are mostly the late 19th and 20th century yards.   There’s not much of a record about individual vessels from the earlier yards.   In the few cases where there wasn’t a match, I asked Freebase to create a new topic.  (I’ll go back later and see how this populates Freebase itself.)   I also did this for the Owners column, which was able to match a smaller number of organizations and people.  I don’t know whether I’m being a little cavalier about making new topics on Freebase,  but this seems like the easiest thing to do. (I should probably be keeping better notes about what new things I’m creating – it would nice to get some sort of report/e-mail with all those things listed). The latter part has been slow going due to a bug in Refine that takes you back to the first row after reconciling a row that may be deep in your data. It helps to select the (none) facet that removes rows for which judgements have been assigned and use additional facets to narrow things down.  While I’ve cut this list of owners down significantly I’m still looking at a long-tail of about 400 unmatched entities. (many are individuals who’s first names are abbreviated – with a little googling I can find many of them and expand the names).

A resource that is proving useful to double-check my work is Shipbuilding History that includes lists of vessels from the Wilmington yards. Tim seems to have collected some information I’m missing, so I’m thinking about the best way to reconcile his information with mine.  There are other lists of vessels that are currently not linked data, but are large tables on the web.  Perhaps a screenscraper might make quick work of turning those into linked data graphs that can be merged with my graphs.

But I think I’ve hit my limit for tonight. (I’ve been grading all day, so more than 14 hrs of staring at a screen is probably enough – time to hit my bunk).

On the Ways (Part I)

05/9/11

On the ways

Tugboat "Neptune" under construction at Jackson and Sharp

Tugboat "Neptune" under construction at Jackson and Sharp. Jackson and Sharp Collection. Courtesy Delaware Public Archives

This will be a short post this week as I’ve decided to use my “study night” to dip my toes into LaTeX tonight (exploring actual dissertation production workflows, weee!).

ways (n.): df. structure consisting of a sloping way down to the water from the place where ships are built or repaired

After creating records for various agents involved in my data,  there remains the data about the vessels themselves.    Again sticking to a pragmatic exploration, I’ll be using the Freebase Ship schema for this data.  Thankfully many of the properties listed here are properties that I already have in my database. I’m still working with Google Regine to clean up my data, but here is how the properties will map.

My Data Freebase Property
ShipName type.object.name
HullNo e.g. DE 107 and/or a particular ID assigned by a shipyard.  (see the Hagley photos for examples at P&J.  I don’t see a Freebase property for this field in my data.
ShipType boats/ship_class/ship_type
HullType (boats/hull_configuration?)
there are no instances of this property in Freebase
ShipPower boats/ship/means_of_propulsion
LOA boats/ship/length_overall
Beam boats/ship/beam
Displace boats/ship/displacement
Tonnage (same as displacement?) hmm..
Draft boats/ship/draught
LaunchDate boat/ship/launched
Fate boat/boat_fate
Yard boats/ship/ship_builder
Designer boats/ship/designer
Owner boats/ship/owners

There seem to be some properties that are part of different, but related Freebase schemas (e.g. boats/ship/displacement and boats/ship_class/displacement_tons) that I need to sort out.  There are also other Freebase properties that don’t map directly to columns in my table (e.g. notableFor),  but may be useful for adding some of the stories around vessels found in my book , a comments field with general notes (and in the banker’s box of research notes that haven’t seen the light of day).

I see another chicken/egg problem looming.  While I have most of the shipyards in my previously created corporate.rdf file and individual shipwrights will follow in people.rdf, I also need to add the individuals/corporations who are owners/agents.  In an early post @jonvoss pointed me towards a Google Refine plugin to rectify linked data.  This seems like a good place to try and deploy this at scale. (hand-crafting a few records for the yards wasn’t too hard,  but there are hundreds of owners/buyers, etc.).  I remain a little skeptical that this will work well.  From what I’ve seen so far of Freebase,  big popular things are represented  but things in the long tail are not. (there’s a research question in there somewhere).

I am also considering using an opaque URI for these vessels, perhaps based on an auto-generated ID number.  The ship names in my database overlap quite a bit, making using names directly in URIs a little dangerous (the same can probably be said for people/corporate names at a global scale – not so much for my limited set).   It may also be possible to use a combination of fields to generate opaque URIs (for example, see Styles, Ayers & Shabir (2008) Semantic MARC, MARC21 and the Semantic Web. LDOW2008.)

05/2/11

American Clyde


The concentration of shipyards that stretched along the Delaware River (from Wilmington north through Philadelphia, PA, Camden and Trenton, NJ) earned it the nickname “American Clyde”  after the  River Clyde in Scotland.   (read a contemporary account “American Clyde” from Harpers 1878).

Early shipwrights tended to be a few well known people, but as shipbuilding became more capital intensive it gave rise to companies and corporations that could organize finances, labor forces and materials for larger efforts.

I’m already noticing that that my RDF creation will be an iterative process – I can’t associate a person with a company until I’ve figured out what that company’s URI will be.  Here’s where working with a system would come in handy – thought I’m not sure how they solve the chicken-and-egg problem of referring to an entity until at URI is minted.   Having a clear convention for URIs may be one way around that problem.  At the moment I’m just minting hash URIs based on names. Personal (firstname_lastname) URIs are a little easier,  so I can link corporate records to future people records.

URIs based on names obviously create a problem for companies that merge, divide, incorporate, etc. This problems seems to lend credence to creating opaque URIs at some point.  For now, I’ll stick with the convention of names in my URIs, but will try to link my records with existing Freebase URIs if they exist.

How to do that well may be a bit of a problem. This example graph for Pusey and Jones from Freebase exhibits many of the problems outlined in Halpin and Hayes (2010) When owl:sameAs isn’t the Same: An Analysis of IdentityLinks on the Semantic Web.   While there is a continuation from the earlier Pusey and Jones Company to the later Pusey and Jones Corporation,  these two entities are separated in time and legal status.  Chasing down these differences is one of the fun parts  of archival research (says the masochist in me).  While you might see the difference only in a preferred form of a name, changes in name may also be involved with change of location, change of business type, etc.   I’ve taken the easy route here – minor changes of name and leadership have been left as a single entity (using the previous names property to record changes).  Major acquisitions require a new record (with the two entities linked).  Later I’d like to come back to the this question through the eyes of EAC-CPF, which may be better tuned for these kinds of subtle changes. Also, how complicated you want to make this probably depends on what you’d like to do with the RDF. Freebase/Wikipedia/dbPedia take a pretty high-level approach, which may mean that it will be of limited use for certain kinds of analysis.

Despite these problems,  the Freebase properties still seem like a place to start since they have properties that will link to legal/conceptual entities together.  Many of the available properties are listed in Freebase, but they haven’t been completed for Pusey & Jones.  I played around a little with the Freebase editing and even added a few values,  but in the end created my own RDF graphs for most of the major Wilmington shipyards.  These are pretty simple stubs, with names, start and end dates and references to individual founders (they still need to be added to the people file).

Oh, one last thing.  Validate, validate, validate.  I caught a few minor errors by running this through the W3C RDF Validation service.

04/25/11

Of Ships and Men (Part 2)

Receipt, E. I. du Pont de Nemours and Company to William Woodcock, 1806-05-22

DuPont Collection. Hagley Museum and Library

I started out tonight with “How to Publish Linked Data on the Web” and learned that it has been superceded by a new book that promises updated information:  Linked Data: Evolving the Web into a Global Data Space.

Since the last time, I’ve decided to publishing data on my own website – at least until I’ve gotten the feel for all of this and how it will fit together across all of my data.  Once that’s done I’ll consider contributing it to a resource like Freebase since they seem to have a simple import feature. Previously I’d setup a subdomain on my sandbox server (http://wilmingtonships.richardjurban.net).

For the moment I’m going to keep things simple by just using a /resource subdirectory to store my static RDF.  As this project grows, I’ll see whether this works (it is a relatively small data set) or whether a more robust solution is needed.

Here are the first two linked data graphs for this project, representing two of the earliest shipwrights in Wilmington:  William Woodcock, and his son William Woodcock, Jr.

Off to a good start, but already alot of questions.  I’m using Freebase schemas and properties – working a little bit from examples of existing people.  Naturally dbPedia representations are different (metadata standards are like toothbrushes after all), but presumably there is some RDFS somewhere that connects Freebase and dbPedia properties.  Shelved until later.

While this was a quick way to whip up some examples,  I was struggling to grok the RDF for Henry Ford by just looking at it.  Silly wabbit,  triples are for computers.   Loading it into something that gives a more human-friendly presentation is really helpful.  For example, just using the W3C RDF Validation service made the RDF for Henry Ford more understandable.

I did mix in an RDFS Comment with a longer textual description based on some RDF I retrieved from dbPedia.  These don’t seem to be in Freebase output,  so I’m not sure what the general principles of mixing and matching like this will be.  (namespaces, sure, no problem – but is there an affordance to sticking with one schema/format?)

Of Ships and Men:  Part 1 | Part 2 |