Tonight I decided to go back to my large table of >1,000 ships and continue doing some clean-up. However, instead of trying to edit the values I had I gave the Google Refine/Freebase reconciliation service a try. Boy-howdy I really should have taken @jonvoss’s advice and done this sooner. I pretty quickly whipped through my Ship Type column and matched to the /boats/ship/ship_type vocabulary in Freebase. Like any classification task, I think some of the subtleties of my data get lost. For the moment I think that’s OK, but if you care about the difference between a sidewheel paddle steamer and a sternwheel paddle steamer they’ve been lumped together under the same class.
The reconciliation tool made quick work of matching to the companies that make up the majority of the shipyards in my database, which are mostly the late 19th and 20th century yards. There’s not much of a record about individual vessels from the earlier yards. In the few cases where there wasn’t a match, I asked Freebase to create a new topic. (I’ll go back later and see how this populates Freebase itself.) I also did this for the Owners column, which was able to match a smaller number of organizations and people. I don’t know whether I’m being a little cavalier about making new topics on Freebase, but this seems like the easiest thing to do. (I should probably be keeping better notes about what new things I’m creating – it would nice to get some sort of report/e-mail with all those things listed). The latter part has been slow going due to a bug in Refine that takes you back to the first row after reconciling a row that may be deep in your data. It helps to select the (none) facet that removes rows for which judgements have been assigned and use additional facets to narrow things down. While I’ve cut this list of owners down significantly I’m still looking at a long-tail of about 400 unmatched entities. (many are individuals who’s first names are abbreviated – with a little googling I can find many of them and expand the names).
A resource that is proving useful to double-check my work is Shipbuilding History that includes lists of vessels from the Wilmington yards. Tim seems to have collected some information I’m missing, so I’m thinking about the best way to reconcile his information with mine. There are other lists of vessels that are currently not linked data, but are large tables on the web. Perhaps a screenscraper might make quick work of turning those into linked data graphs that can be merged with my graphs.
But I think I’ve hit my limit for tonight. (I’ve been grading all day, so more than 14 hrs of staring at a screen is probably enough – time to hit my bunk).