05/17/11

On the ways (Part II)

"Galen L. Stone" Interior view, ribbing of tug under construction.  Delaware Public Archives

"Galen L. Stone" Interior view, ribbing of tug under construction. Delaware Public Archives

Tonight I decided to go back to my large table of >1,000 ships and continue doing some clean-up.   However, instead of trying to edit the values I had I gave the Google Refine/Freebase reconciliation service a try.  Boy-howdy I really should have taken @jonvoss’s advice and done this sooner.   I pretty quickly whipped through my Ship Type column and matched to the /boats/ship/ship_type vocabulary in Freebase.   Like any classification task, I think some of the subtleties of my data get lost. For the moment I think that’s OK, but if you care about the difference between a sidewheel paddle steamer and a sternwheel paddle steamer they’ve been lumped together under the same class.

The reconciliation tool made quick work of matching to the companies that make up the majority of the shipyards in my database, which are mostly the late 19th and 20th century yards.   There’s not much of a record about individual vessels from the earlier yards.   In the few cases where there wasn’t a match, I asked Freebase to create a new topic.  (I’ll go back later and see how this populates Freebase itself.)   I also did this for the Owners column, which was able to match a smaller number of organizations and people.  I don’t know whether I’m being a little cavalier about making new topics on Freebase,  but this seems like the easiest thing to do. (I should probably be keeping better notes about what new things I’m creating – it would nice to get some sort of report/e-mail with all those things listed). The latter part has been slow going due to a bug in Refine that takes you back to the first row after reconciling a row that may be deep in your data. It helps to select the (none) facet that removes rows for which judgements have been assigned and use additional facets to narrow things down.  While I’ve cut this list of owners down significantly I’m still looking at a long-tail of about 400 unmatched entities. (many are individuals who’s first names are abbreviated – with a little googling I can find many of them and expand the names).

A resource that is proving useful to double-check my work is Shipbuilding History that includes lists of vessels from the Wilmington yards. Tim seems to have collected some information I’m missing, so I’m thinking about the best way to reconcile his information with mine.  There are other lists of vessels that are currently not linked data, but are large tables on the web.  Perhaps a screenscraper might make quick work of turning those into linked data graphs that can be merged with my graphs.

But I think I’ve hit my limit for tonight. (I’ve been grading all day, so more than 14 hrs of staring at a screen is probably enough – time to hit my bunk).

On the Ways (Part I)

04/25/11

Of Ships and Men (Part 2)

Receipt, E. I. du Pont de Nemours and Company to William Woodcock, 1806-05-22

DuPont Collection. Hagley Museum and Library

I started out tonight with “How to Publish Linked Data on the Web” and learned that it has been superceded by a new book that promises updated information:  Linked Data: Evolving the Web into a Global Data Space.

Since the last time, I’ve decided to publishing data on my own website – at least until I’ve gotten the feel for all of this and how it will fit together across all of my data.  Once that’s done I’ll consider contributing it to a resource like Freebase since they seem to have a simple import feature. Previously I’d setup a subdomain on my sandbox server (http://wilmingtonships.richardjurban.net).

For the moment I’m going to keep things simple by just using a /resource subdirectory to store my static RDF.  As this project grows, I’ll see whether this works (it is a relatively small data set) or whether a more robust solution is needed.

Here are the first two linked data graphs for this project, representing two of the earliest shipwrights in Wilmington:  William Woodcock, and his son William Woodcock, Jr.

Off to a good start, but already alot of questions.  I’m using Freebase schemas and properties – working a little bit from examples of existing people.  Naturally dbPedia representations are different (metadata standards are like toothbrushes after all), but presumably there is some RDFS somewhere that connects Freebase and dbPedia properties.  Shelved until later.

While this was a quick way to whip up some examples,  I was struggling to grok the RDF for Henry Ford by just looking at it.  Silly wabbit,  triples are for computers.   Loading it into something that gives a more human-friendly presentation is really helpful.  For example, just using the W3C RDF Validation service made the RDF for Henry Ford more understandable.

I did mix in an RDFS Comment with a longer textual description based on some RDF I retrieved from dbPedia.  These don’t seem to be in Freebase output,  so I’m not sure what the general principles of mixing and matching like this will be.  (namespaces, sure, no problem – but is there an affordance to sticking with one schema/format?)

Of Ships and Men:  Part 1 | Part 2 |