05/17/11

On the ways (Part II)

"Galen L. Stone" Interior view, ribbing of tug under construction.  Delaware Public Archives

"Galen L. Stone" Interior view, ribbing of tug under construction. Delaware Public Archives

Tonight I decided to go back to my large table of >1,000 ships and continue doing some clean-up.   However, instead of trying to edit the values I had I gave the Google Refine/Freebase reconciliation service a try.  Boy-howdy I really should have taken @jonvoss’s advice and done this sooner.   I pretty quickly whipped through my Ship Type column and matched to the /boats/ship/ship_type vocabulary in Freebase.   Like any classification task, I think some of the subtleties of my data get lost. For the moment I think that’s OK, but if you care about the difference between a sidewheel paddle steamer and a sternwheel paddle steamer they’ve been lumped together under the same class.

The reconciliation tool made quick work of matching to the companies that make up the majority of the shipyards in my database, which are mostly the late 19th and 20th century yards.   There’s not much of a record about individual vessels from the earlier yards.   In the few cases where there wasn’t a match, I asked Freebase to create a new topic.  (I’ll go back later and see how this populates Freebase itself.)   I also did this for the Owners column, which was able to match a smaller number of organizations and people.  I don’t know whether I’m being a little cavalier about making new topics on Freebase,  but this seems like the easiest thing to do. (I should probably be keeping better notes about what new things I’m creating – it would nice to get some sort of report/e-mail with all those things listed). The latter part has been slow going due to a bug in Refine that takes you back to the first row after reconciling a row that may be deep in your data. It helps to select the (none) facet that removes rows for which judgements have been assigned and use additional facets to narrow things down.  While I’ve cut this list of owners down significantly I’m still looking at a long-tail of about 400 unmatched entities. (many are individuals who’s first names are abbreviated – with a little googling I can find many of them and expand the names).

A resource that is proving useful to double-check my work is Shipbuilding History that includes lists of vessels from the Wilmington yards. Tim seems to have collected some information I’m missing, so I’m thinking about the best way to reconcile his information with mine.  There are other lists of vessels that are currently not linked data, but are large tables on the web.  Perhaps a screenscraper might make quick work of turning those into linked data graphs that can be merged with my graphs.

But I think I’ve hit my limit for tonight. (I’ve been grading all day, so more than 14 hrs of staring at a screen is probably enough – time to hit my bunk).

On the Ways (Part I)

05/9/11

On the ways

Tugboat "Neptune" under construction at Jackson and Sharp

Tugboat "Neptune" under construction at Jackson and Sharp. Jackson and Sharp Collection. Courtesy Delaware Public Archives

This will be a short post this week as I’ve decided to use my “study night” to dip my toes into LaTeX tonight (exploring actual dissertation production workflows, weee!).

ways (n.): df. structure consisting of a sloping way down to the water from the place where ships are built or repaired

After creating records for various agents involved in my data,  there remains the data about the vessels themselves.    Again sticking to a pragmatic exploration, I’ll be using the Freebase Ship schema for this data.  Thankfully many of the properties listed here are properties that I already have in my database. I’m still working with Google Regine to clean up my data, but here is how the properties will map.

My Data Freebase Property
ShipName type.object.name
HullNo e.g. DE 107 and/or a particular ID assigned by a shipyard.  (see the Hagley photos for examples at P&J.  I don’t see a Freebase property for this field in my data.
ShipType boats/ship_class/ship_type
HullType (boats/hull_configuration?)
there are no instances of this property in Freebase
ShipPower boats/ship/means_of_propulsion
LOA boats/ship/length_overall
Beam boats/ship/beam
Displace boats/ship/displacement
Tonnage (same as displacement?) hmm..
Draft boats/ship/draught
LaunchDate boat/ship/launched
Fate boat/boat_fate
Yard boats/ship/ship_builder
Designer boats/ship/designer
Owner boats/ship/owners

There seem to be some properties that are part of different, but related Freebase schemas (e.g. boats/ship/displacement and boats/ship_class/displacement_tons) that I need to sort out.  There are also other Freebase properties that don’t map directly to columns in my table (e.g. notableFor),  but may be useful for adding some of the stories around vessels found in my book , a comments field with general notes (and in the banker’s box of research notes that haven’t seen the light of day).

I see another chicken/egg problem looming.  While I have most of the shipyards in my previously created corporate.rdf file and individual shipwrights will follow in people.rdf, I also need to add the individuals/corporations who are owners/agents.  In an early post @jonvoss pointed me towards a Google Refine plugin to rectify linked data.  This seems like a good place to try and deploy this at scale. (hand-crafting a few records for the yards wasn’t too hard,  but there are hundreds of owners/buyers, etc.).  I remain a little skeptical that this will work well.  From what I’ve seen so far of Freebase,  big popular things are represented  but things in the long tail are not. (there’s a research question in there somewhere).

I am also considering using an opaque URI for these vessels, perhaps based on an auto-generated ID number.  The ship names in my database overlap quite a bit, making using names directly in URIs a little dangerous (the same can probably be said for people/corporate names at a global scale – not so much for my limited set).   It may also be possible to use a combination of fields to generate opaque URIs (for example, see Styles, Ayers & Shabir (2008) Semantic MARC, MARC21 and the Semantic Web. LDOW2008.)

05/2/11

American Clyde


The concentration of shipyards that stretched along the Delaware River (from Wilmington north through Philadelphia, PA, Camden and Trenton, NJ) earned it the nickname “American Clyde”  after the  River Clyde in Scotland.   (read a contemporary account “American Clyde” from Harpers 1878).

Early shipwrights tended to be a few well known people, but as shipbuilding became more capital intensive it gave rise to companies and corporations that could organize finances, labor forces and materials for larger efforts.

I’m already noticing that that my RDF creation will be an iterative process – I can’t associate a person with a company until I’ve figured out what that company’s URI will be.  Here’s where working with a system would come in handy – thought I’m not sure how they solve the chicken-and-egg problem of referring to an entity until at URI is minted.   Having a clear convention for URIs may be one way around that problem.  At the moment I’m just minting hash URIs based on names. Personal (firstname_lastname) URIs are a little easier,  so I can link corporate records to future people records.

URIs based on names obviously create a problem for companies that merge, divide, incorporate, etc. This problems seems to lend credence to creating opaque URIs at some point.  For now, I’ll stick with the convention of names in my URIs, but will try to link my records with existing Freebase URIs if they exist.

How to do that well may be a bit of a problem. This example graph for Pusey and Jones from Freebase exhibits many of the problems outlined in Halpin and Hayes (2010) When owl:sameAs isn’t the Same: An Analysis of IdentityLinks on the Semantic Web.   While there is a continuation from the earlier Pusey and Jones Company to the later Pusey and Jones Corporation,  these two entities are separated in time and legal status.  Chasing down these differences is one of the fun parts  of archival research (says the masochist in me).  While you might see the difference only in a preferred form of a name, changes in name may also be involved with change of location, change of business type, etc.   I’ve taken the easy route here – minor changes of name and leadership have been left as a single entity (using the previous names property to record changes).  Major acquisitions require a new record (with the two entities linked).  Later I’d like to come back to the this question through the eyes of EAC-CPF, which may be better tuned for these kinds of subtle changes. Also, how complicated you want to make this probably depends on what you’d like to do with the RDF. Freebase/Wikipedia/dbPedia take a pretty high-level approach, which may mean that it will be of limited use for certain kinds of analysis.

Despite these problems,  the Freebase properties still seem like a place to start since they have properties that will link to legal/conceptual entities together.  Many of the available properties are listed in Freebase, but they haven’t been completed for Pusey & Jones.  I played around a little with the Freebase editing and even added a few values,  but in the end created my own RDF graphs for most of the major Wilmington shipyards.  These are pretty simple stubs, with names, start and end dates and references to individual founders (they still need to be added to the people file).

Oh, one last thing.  Validate, validate, validate.  I caught a few minor errors by running this through the W3C RDF Validation service.

03/28/11

Scanning the Horizon

Before I get started building my own data for this project it seems like it would be useful to see what linked data is already available and what kinds of properties are being assigned to each of the entities I’ve identified. To start I’ve only looked at dbPedia, although it also includes some links to Freebase. I would be interested to hear if there are other common ways to do some due diligence in the growing LOD cloud before creating new contributions.

People

I didn’t find any of the key players in Wikipedia or dbPedia,  although they may be mentioned in the articles for shipyards below. In addition to FOAF, dbPedia has additional properties for relationships between people and companies. (e.g. see Andrew Carnegie)

Companies

Most of the big yards are represented, but all of the smaller, earlier shipyards are absent.  I’ll probably start with these existing records, making sure the companies are equally represented and work on some of the other yards later.

Bethlehem, Dravo and ACF were very large corporations with many divisions and multiple shipyards.  These descriptions point to the larger entity, but not to the subdivision. Perhaps a little archival context needed here?

Vessels

This is just a small sample (for more, see  Ships Built in Delaware), but a useful example of the properties represented in dbPedia (vessels are in the Ships class, but the properties don’t seem to be explicitly associated with that class. I’ll need to do some more digging into how the classes/properties are defined here). Fortunately, my database has many of the same properties, which should make mapping it easy. I’m currently working with Google Refine to clean it up what I have.

I am confused about how these graphs link to the graphs for companies above.  For example,  the description of the U.S.S. Louisiana includes a link to Harlan & Hollingsworth in its Wikipedia Infobox but the dbPedia entry just has a literal.

Events

There are properties for both companies and vessels that represent events (like founding, launching, decommissioning, etc.).  There are also Wikipedia categories such as Companies Founded in 1899 which seem a little redundant if you have formated information (though the “Ships Built in Delaware” is useful, seems like there should be another way to do this).

Locations

While the records above indicate the companies were located in Wilmington, DE,  none of them have a specific geolocation.  I whipped up a quick Google Map based on my original fly-leaf illustration. This was a quick start, I’ll need to translate these into latitude & longitude to add to the descriptions. I don’t know whether it’s possible to use other shapes to indicate the full extent to some of the yards that stretched along the waterfront.


View Wilmington Shipyards in a larger map