06/17/13

Introducing LODLAM Patterns

Crossposted at LODLAM Summit 2013:  Introducing LODLAM Patterns

Linked Data provides us with an incredible opportunity to re-think how we approach sharing information about LAM collections.  However, these opportunities are also fraught with danger and important challenges that we must face.  Translating existing standards into compliant Linked Data will take more than just cross-walking terms with similar meanings, it also means mapping between conceptual models and ontologies.   Linked Data also provides us new opportunities to mix models and vocabularies in ways that we haven’t been able to do before.  How can we take better advantage of these opportunities?

Ultimately, creating Linked Data standards and practices is a set of design problems that we are all engaged in.   Elizabeth Churchill has called for “Data Aware Design” and the need to bring human-computer interaction methods to bear on these problems.  At the Summit I will be presenting a Dork Short about a new site that I’m launching to do just this.   LODLAM Patterns will identify Linked Data design patterns (which I’m calling representation patterns) for cultural heritage resources.   The idea is to identify common problems that we are trying to solve and link them to the solutions that are available across the many, many standards for describing LAM resources.  My goal is to create a resource that will spur discussions focused on problems/solutions,  provide newcomers a way to navigate the LOD standards universe, and a pedagogical tool to teach “design-thinking” for Linked Data.

Participate by signing up at http://lodlampatterns.org or follow along @lodlamp or #lodlamp.

06/5/13

Reconciling Museums Count

From the tweets, it sounds like there were several interesting projects working with the Museums Count data.  Dylan Barrtlet combined the IMLS data with IRS data to clean up some of the address info.   Michael Girarldo mashed up the data with the public library data for a proof-of-concept mobile app that would help users locate museums and libraries in their vicinity.

I’ve continued to play around with this data using OpenRefine and the DBpedia SPARQL endpoint.  Attempting to reconcile against no type, I found approximately 19% of the museums in the IMLS data.

Doing a spot check of the unmatched entities:

  • if it is a simply named entity,  it’s not in Wikipedia/DBpedia
  • it’s an organization that operates a museum of a different name or has a different legal name than the one it’s known by inDBpedia. e.g:
  • if it is a complex name (i.e. dirty IMLS data),  it’s not matched
  • abbreviations in the name that cannot be matched (e.g. Ntnl (national),  Ctr (center), Hist (history/historical), Inc., etc.) or conjunctions/punctuation e.g. ( & for and).
Some of these problems might be fixed by cleaning up the names,  but some of the disambiguation may require human intervention.   I haven’t tried looking up entities in DBpedia by address, but that might also help identify things that are uncertain.
Some of these mismatches do raise interesting ontological questions and gets back to the issue of a Museum (organization) vs. Museum (a place).   It looks like there’s lots of unreconciled houses, historic sites, etc.  with different legal names than the places they are associated with.   What will be the best way to represent/associate these entities?

 Mismatches

Looking at the “matched” items,  I found that lots of generically named museums were matched to a specific museum.  For example, there are many “Art Gallery” things (usually at this or that college) that were all matched to the same DBpedia entity. Likewise, there are about 15 different “Pioneer Museum”, “Museum of Natural History,” “Museum of Anthropology,” “Museum of Art,” “University Museum,” “Cowboy Hall of Fame.”   Another area of mismatch are county/city historical societies where the locality has the same name (i.e 12 different “Douglas County Historical Society” all matched to Douglas County Historical Society in Nebraska.

There are also multiple sites that are maintained by a single entity, like a state museum network or a city.  For example,  The Peale Museum (aka Municipal Museum of Baltimore) and Edgar Allen Poe House and Museum are simply listed here as “City of Baltimore” It was necessary to look up the addresses to see what museum entity was there.

So clearly, a more robust approach to reconciliation is needed, perhaps including the city/state of an entity in order to disambiguate similar names.

Lots of challenges here, but also seems to be lots of opportunities to add to and enrich museum representation in Linked Data/wiki resources.

05/31/13

Quick Museum Counts update

oh well, best laid plans…

Didn’t quite get as far as I’d hoped this week.   Following Justin’s comments in the previous post,  I did a quick mapping to the Organization Ontology,  vCard, wsg84 and schema.org (seems to be the only one with a DUNS property).

A sample in Turtle looks like and you can download the full set as Turtle:

@prefix schema: <http://schema.org/> .
@prefix v: <http://www.w3.org/2006/vcard/ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix org: <http://www.w3.org/ns/org#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
<http://chi.cci.fsu.edu/resource/museum/imls_0> a org:Organization ;
 skos:prefLabel "Carmel Valley Historical Society" ;
 org:hasSite "http://chi.cci.fsu.edu/resource/site/imls_0" .
<http://chi.cci.fsu.edu/site/imls_0> a org:Site ;
 geo:long "-121.726912" ;
 geo:lat "36.47231" ;
 v:street-address "PO BOX 1427" ;
 v:locality "Carmel Valley" ;
 v:region "CA" ;
 v:postal-code "93924" .

Since I’m not aware of an authorized URI for IMLS resources,  I just minted one for my own domain.  However at this time these don’t resolve to anything.   Ideally I might be able to reconcile many of the city/zip codes to a previously published URI for a place.

I’ll be offline for most of tomorrow, but will try to check in the evening.

05/30/13

What is a Library/Archive/Museum According to Linked Data?

In my previous post,  Hacking Museums Count,  I introduced the data that IMLS has released for the National Civic Day of Hacking and took a first stab at assigning LOD properties to the data they provided.     This was based on some of my previous work mapping IMLS DCC data.  Following my initial observations,  I decided to take a closer look at how libraries, archives, and museums are currently modeled in Linked Data resources such as Freebase, DBpedia, and Schema.org.

What is a Library?

“A library…is an organized collection of information resources made accessible to a defined community for reference or borrowing.” (Wikipedia)
  • /organization/local business/library (Schema.org)
  • /organization/educational institution/library (DBpedia)
  • /place/Architectural Structure/Building/library (DBpedia)
  • /architecture/building function/ (Freebase/Wikipedia)
  • /library (Freebase)

What is an Archive?

“An archive is an accumulation of historical records, or the physical place they are located.” (Wikipedia)

  • notably, Schema.org does not have an explicit class for archives
  • /organization/government agency/ (DBpedia)
  • /archive  (dbPedia)   although this class doesn’t seem to be associated with other top-level classes.  Wikipedia/dbPedia resources are given as instances of this class.
  • /organization/organization_type (Freebase)
  • /organization/organization_sector (Freebase)
Best one yet….  /fictional organization type! (Freebase)

What is a Museum?

“A museum is a building or institution dedicated to the acquisition, conservation, study, exhibition, and educational interpretation of objects having scientific, historical, cultural or artistic value.” (Wikipedia).

  • /place/civicStructure/Museum (Schema.org)
  • /place/Architectural Structure/Building/Museum (DBpedia)
  • /architecture/museum (Freebase)

This is only a partial listing of the kinds of thinks a LAM can be according to these linked data sources.  It raises some interesting questions about how to properly model the IMLS data.  What is IMLS’s view on this information?   The emphasis of the total dataset is on location (address, lat/long, Census Block, etc.), so perhaps this is reflected in how LAMs are represented in Linked Data.   I’ll need to go back and revisit my assumptions about LAMs being organizations that are associated with a structure of some kind.

So what?

As we talk about the convergence of LAMs,  the different ways that each sector has been represented as top-level Linked Data classes raises some interesting questions.

  • What does this tell us about public (well…LD public) perception of LAMs?  The narrative definitions on Wikipedia pages seem incongruent with the ontological classes. (is a library a collection? an organization? a building?)
  • Much of our attention as professionals is directed at representing our collections.  What is our responsibility to ensure that LD concepts reflect our understanding of what we do?
  • What is the impact on these specifications on our other Linked Data work?  Given some of the choices that have been made so far,  it could lead to some unexpected inferences:
    • <A Building> <hasCopyright> <An Image>
    • <A Painting> <isPhysicallyLocatedIn> <An Organization>
    • <A Person> <employedBy> <A Place>
The danger here (as with much Linked Data) is that we are talking about a few different entities that overlap.   The Museum (Legal Entity);  The Museum (a building); The Museum (a functional role as an organization that collects stuff).  Perhaps by looking at other kinds of entities that are similar/different to museums (i.e. non-profits, performing arts, businesses, etc.) we can see some alternatives that neatly address this problem.

 

05/28/13

Hacking Museums Count

A few years ago (has it been that long already??) I wrote about the DPLA Beta Sprint we created for the IMLS Digital Collections and Content Project (see: 12, 3). As part of the sprint, I created Linked Data representations for the IMLS DCC Collection-level records. A portion of those records included basic information about contributing IMLS DCC partners. Behind the scenes this data was used to maintain relationships with partners, but we also started using this information to build browse features and  visualizations of what the collection looked like (see the current IMLS DCC interface, my paper on Collections Dashboards).

In the course of this project I discovered a there is a fundamental ontological difference between how museums and libraries are represented in the current Linked Data cloud.   It was pretty easy to reconcile library entities, because the Public Library Survey data had been ingested into Freebase.  Museums were much more hit-or-miss.  Looking closer that the data, I  realized that libraries were usually represented as a kind of organization, but museums were considered a kind of building.  This may be because much of the information about museums in the U.S. is derived from the National Register of Historic Places dataset that was also ingested into dbPedia/Freebase.

As part of the National Civic Day of Hacking, IMLS has issued the Museum Data Challenge.   Included in the challenge is a minimal set of data on 35,000 museums.   I don’t think I’ll be able to participate directly this weekend, so I thought I’d take a look at the data that IMLS has released and see what I can do to make someone else’s hacking easier this weekend. Also included in the IMLS challenge is the Public Library Service Data (and data from the work of my colleague Christie Koonz,  imaplibraries.org).  Also check out the DPLA Challenge  and the Pocket Archivist Mobile Challenge from NARA.

Goals for the week:

  1. Do any clean up needed. (right now the data *looks* pretty clean, but the challenge suggests that their names and geolocation may be faulty).  The fact that much of the LOD about museums is from NRHP might allow me to identify inconsistencies in the IMLS dataset.
  2. Convert the CSV data into Linked Data.
    1. Identify appropriate Linked Data properties for this data (see below).
    2. Transform the CSV into JSON-LD
      1. publish on GitHub
  3. Associate these representations of museums as organizations with representations of museums as buildings in the current Linked Data cloud.
    1. Submit data to dbPedia/Freebase

Here’s a start on identifying LOD properties for the IMLS data release:

IMLS Field Description   LODProperty LOD Comment
id unique identifier   this is just a autogenerated ID number. Unclear whether this has any meaning to IMLS.
name institution name skos:PrefLabel per Organization ontology.  Alternate v:organisation-name
address institution street address v:street-address vCard
city institution city v:locality vCard
state institution state v:region vCard
zip institution zip code v:postal-code vCard
zip4 institution zip+4 v:postal-code vCard
longitude longitude  decimal degree format World Geodetic System Datum 1984 wgs84_pos#lat wgs84
latitude latitude  decimal degree format World Geodetic System Datum 1984 wgs84_pos#long wgs84
phone phone number v:tel vCard
duns DUNS number  Dun & Bradstreet Numeric Identifier org:Identifier there doesn’t seem to be an RDF property for DUNS numbers yet. Is There a better way to differentiate DUNS from EIN?
ein EIN number  Federal Employer Identification Number org:Identifier there doesn’t seem to be an RDF property for EIN numbers yet.

The IMLS data also includes the following fields,  though I haven’t been able to identify any LOD properties for these yet.  This is actually a bit surprising, since you’d think that U.S. Census data (or at least the properties of Census data) would be a solved problem by now.  For the moment, the information above seems like enough of a start, so I’ll leave these aside.


fipst FIPS State code
fipsco FIPS county code
centract seven character Census tract number
cenblock four character Census block number
fipsplc five-digit place FIPS code
fipsmcd five-digit MCD (Minor Civil Division) FIPS code
fipsmsa four-digit MSA (Metropolitan Statistical Area) FIPS code
cbsa five-digit CBSA code that identifies a CBSA area.
metrod five-digit Metropolitan Division Code
microf micropolitan flag
Metropolitan Area or a “1″  indicating a Micropolitan area
mattype geocoding match type

Next up, I’ll discuss in more depth how museums have been modeled in the current Linked Data environment and suggest some possible models for the IMLS dataset.

Next: What is a Library/Archive/Museum According to Linked Data? 

05/17/11

On the ways (Part II)

"Galen L. Stone" Interior view, ribbing of tug under construction.  Delaware Public Archives

"Galen L. Stone" Interior view, ribbing of tug under construction. Delaware Public Archives

Tonight I decided to go back to my large table of >1,000 ships and continue doing some clean-up.   However, instead of trying to edit the values I had I gave the Google Refine/Freebase reconciliation service a try.  Boy-howdy I really should have taken @jonvoss’s advice and done this sooner.   I pretty quickly whipped through my Ship Type column and matched to the /boats/ship/ship_type vocabulary in Freebase.   Like any classification task, I think some of the subtleties of my data get lost. For the moment I think that’s OK, but if you care about the difference between a sidewheel paddle steamer and a sternwheel paddle steamer they’ve been lumped together under the same class.

The reconciliation tool made quick work of matching to the companies that make up the majority of the shipyards in my database, which are mostly the late 19th and 20th century yards.   There’s not much of a record about individual vessels from the earlier yards.   In the few cases where there wasn’t a match, I asked Freebase to create a new topic.  (I’ll go back later and see how this populates Freebase itself.)   I also did this for the Owners column, which was able to match a smaller number of organizations and people.  I don’t know whether I’m being a little cavalier about making new topics on Freebase,  but this seems like the easiest thing to do. (I should probably be keeping better notes about what new things I’m creating – it would nice to get some sort of report/e-mail with all those things listed). The latter part has been slow going due to a bug in Refine that takes you back to the first row after reconciling a row that may be deep in your data. It helps to select the (none) facet that removes rows for which judgements have been assigned and use additional facets to narrow things down.  While I’ve cut this list of owners down significantly I’m still looking at a long-tail of about 400 unmatched entities. (many are individuals who’s first names are abbreviated – with a little googling I can find many of them and expand the names).

A resource that is proving useful to double-check my work is Shipbuilding History that includes lists of vessels from the Wilmington yards. Tim seems to have collected some information I’m missing, so I’m thinking about the best way to reconcile his information with mine.  There are other lists of vessels that are currently not linked data, but are large tables on the web.  Perhaps a screenscraper might make quick work of turning those into linked data graphs that can be merged with my graphs.

But I think I’ve hit my limit for tonight. (I’ve been grading all day, so more than 14 hrs of staring at a screen is probably enough – time to hit my bunk).

On the Ways (Part I)

05/9/11

On the ways

Tugboat "Neptune" under construction at Jackson and Sharp

Tugboat "Neptune" under construction at Jackson and Sharp. Jackson and Sharp Collection. Courtesy Delaware Public Archives

This will be a short post this week as I’ve decided to use my “study night” to dip my toes into LaTeX tonight (exploring actual dissertation production workflows, weee!).

ways (n.): df. structure consisting of a sloping way down to the water from the place where ships are built or repaired

After creating records for various agents involved in my data,  there remains the data about the vessels themselves.    Again sticking to a pragmatic exploration, I’ll be using the Freebase Ship schema for this data.  Thankfully many of the properties listed here are properties that I already have in my database. I’m still working with Google Regine to clean up my data, but here is how the properties will map.

My Data Freebase Property
ShipName type.object.name
HullNo e.g. DE 107 and/or a particular ID assigned by a shipyard.  (see the Hagley photos for examples at P&J.  I don’t see a Freebase property for this field in my data.
ShipType boats/ship_class/ship_type
HullType (boats/hull_configuration?)
there are no instances of this property in Freebase
ShipPower boats/ship/means_of_propulsion
LOA boats/ship/length_overall
Beam boats/ship/beam
Displace boats/ship/displacement
Tonnage (same as displacement?) hmm..
Draft boats/ship/draught
LaunchDate boat/ship/launched
Fate boat/boat_fate
Yard boats/ship/ship_builder
Designer boats/ship/designer
Owner boats/ship/owners

There seem to be some properties that are part of different, but related Freebase schemas (e.g. boats/ship/displacement and boats/ship_class/displacement_tons) that I need to sort out.  There are also other Freebase properties that don’t map directly to columns in my table (e.g. notableFor),  but may be useful for adding some of the stories around vessels found in my book , a comments field with general notes (and in the banker’s box of research notes that haven’t seen the light of day).

I see another chicken/egg problem looming.  While I have most of the shipyards in my previously created corporate.rdf file and individual shipwrights will follow in people.rdf, I also need to add the individuals/corporations who are owners/agents.  In an early post @jonvoss pointed me towards a Google Refine plugin to rectify linked data.  This seems like a good place to try and deploy this at scale. (hand-crafting a few records for the yards wasn’t too hard,  but there are hundreds of owners/buyers, etc.).  I remain a little skeptical that this will work well.  From what I’ve seen so far of Freebase,  big popular things are represented  but things in the long tail are not. (there’s a research question in there somewhere).

I am also considering using an opaque URI for these vessels, perhaps based on an auto-generated ID number.  The ship names in my database overlap quite a bit, making using names directly in URIs a little dangerous (the same can probably be said for people/corporate names at a global scale – not so much for my limited set).   It may also be possible to use a combination of fields to generate opaque URIs (for example, see Styles, Ayers & Shabir (2008) Semantic MARC, MARC21 and the Semantic Web. LDOW2008.)

05/2/11

American Clyde


The concentration of shipyards that stretched along the Delaware River (from Wilmington north through Philadelphia, PA, Camden and Trenton, NJ) earned it the nickname “American Clyde”  after the  River Clyde in Scotland.   (read a contemporary account “American Clyde” from Harpers 1878).

Early shipwrights tended to be a few well known people, but as shipbuilding became more capital intensive it gave rise to companies and corporations that could organize finances, labor forces and materials for larger efforts.

I’m already noticing that that my RDF creation will be an iterative process – I can’t associate a person with a company until I’ve figured out what that company’s URI will be.  Here’s where working with a system would come in handy – thought I’m not sure how they solve the chicken-and-egg problem of referring to an entity until at URI is minted.   Having a clear convention for URIs may be one way around that problem.  At the moment I’m just minting hash URIs based on names. Personal (firstname_lastname) URIs are a little easier,  so I can link corporate records to future people records.

URIs based on names obviously create a problem for companies that merge, divide, incorporate, etc. This problems seems to lend credence to creating opaque URIs at some point.  For now, I’ll stick with the convention of names in my URIs, but will try to link my records with existing Freebase URIs if they exist.

How to do that well may be a bit of a problem. This example graph for Pusey and Jones from Freebase exhibits many of the problems outlined in Halpin and Hayes (2010) When owl:sameAs isn’t the Same: An Analysis of IdentityLinks on the Semantic Web.   While there is a continuation from the earlier Pusey and Jones Company to the later Pusey and Jones Corporation,  these two entities are separated in time and legal status.  Chasing down these differences is one of the fun parts  of archival research (says the masochist in me).  While you might see the difference only in a preferred form of a name, changes in name may also be involved with change of location, change of business type, etc.   I’ve taken the easy route here – minor changes of name and leadership have been left as a single entity (using the previous names property to record changes).  Major acquisitions require a new record (with the two entities linked).  Later I’d like to come back to the this question through the eyes of EAC-CPF, which may be better tuned for these kinds of subtle changes. Also, how complicated you want to make this probably depends on what you’d like to do with the RDF. Freebase/Wikipedia/dbPedia take a pretty high-level approach, which may mean that it will be of limited use for certain kinds of analysis.

Despite these problems,  the Freebase properties still seem like a place to start since they have properties that will link to legal/conceptual entities together.  Many of the available properties are listed in Freebase, but they haven’t been completed for Pusey & Jones.  I played around a little with the Freebase editing and even added a few values,  but in the end created my own RDF graphs for most of the major Wilmington shipyards.  These are pretty simple stubs, with names, start and end dates and references to individual founders (they still need to be added to the people file).

Oh, one last thing.  Validate, validate, validate.  I caught a few minor errors by running this through the W3C RDF Validation service.

03/28/11

Scanning the Horizon

Before I get started building my own data for this project it seems like it would be useful to see what linked data is already available and what kinds of properties are being assigned to each of the entities I’ve identified. To start I’ve only looked at dbPedia, although it also includes some links to Freebase. I would be interested to hear if there are other common ways to do some due diligence in the growing LOD cloud before creating new contributions.

People

I didn’t find any of the key players in Wikipedia or dbPedia,  although they may be mentioned in the articles for shipyards below. In addition to FOAF, dbPedia has additional properties for relationships between people and companies. (e.g. see Andrew Carnegie)

Companies

Most of the big yards are represented, but all of the smaller, earlier shipyards are absent.  I’ll probably start with these existing records, making sure the companies are equally represented and work on some of the other yards later.

Bethlehem, Dravo and ACF were very large corporations with many divisions and multiple shipyards.  These descriptions point to the larger entity, but not to the subdivision. Perhaps a little archival context needed here?

Vessels

This is just a small sample (for more, see  Ships Built in Delaware), but a useful example of the properties represented in dbPedia (vessels are in the Ships class, but the properties don’t seem to be explicitly associated with that class. I’ll need to do some more digging into how the classes/properties are defined here). Fortunately, my database has many of the same properties, which should make mapping it easy. I’m currently working with Google Refine to clean it up what I have.

I am confused about how these graphs link to the graphs for companies above.  For example,  the description of the U.S.S. Louisiana includes a link to Harlan & Hollingsworth in its Wikipedia Infobox but the dbPedia entry just has a literal.

Events

There are properties for both companies and vessels that represent events (like founding, launching, decommissioning, etc.).  There are also Wikipedia categories such as Companies Founded in 1899 which seem a little redundant if you have formated information (though the “Ships Built in Delaware” is useful, seems like there should be another way to do this).

Locations

While the records above indicate the companies were located in Wilmington, DE,  none of them have a specific geolocation.  I whipped up a quick Google Map based on my original fly-leaf illustration. This was a quick start, I’ll need to translate these into latitude & longitude to add to the descriptions. I don’t know whether it’s possible to use other shapes to indicate the full extent to some of the yards that stretched along the waterfront.


View Wilmington Shipyards in a larger map