06/17/13

Introducing LODLAM Patterns

Crossposted at LODLAM Summit 2013:  Introducing LODLAM Patterns

Linked Data provides us with an incredible opportunity to re-think how we approach sharing information about LAM collections.  However, these opportunities are also fraught with danger and important challenges that we must face.  Translating existing standards into compliant Linked Data will take more than just cross-walking terms with similar meanings, it also means mapping between conceptual models and ontologies.   Linked Data also provides us new opportunities to mix models and vocabularies in ways that we haven’t been able to do before.  How can we take better advantage of these opportunities?

Ultimately, creating Linked Data standards and practices is a set of design problems that we are all engaged in.   Elizabeth Churchill has called for “Data Aware Design” and the need to bring human-computer interaction methods to bear on these problems.  At the Summit I will be presenting a Dork Short about a new site that I’m launching to do just this.   LODLAM Patterns will identify Linked Data design patterns (which I’m calling representation patterns) for cultural heritage resources.   The idea is to identify common problems that we are trying to solve and link them to the solutions that are available across the many, many standards for describing LAM resources.  My goal is to create a resource that will spur discussions focused on problems/solutions,  provide newcomers a way to navigate the LOD standards universe, and a pedagogical tool to teach “design-thinking” for Linked Data.

Participate by signing up at http://lodlampatterns.org or follow along @lodlamp or #lodlamp.

06/5/13

Reconciling Museums Count

From the tweets, it sounds like there were several interesting projects working with the Museums Count data.  Dylan Barrtlet combined the IMLS data with IRS data to clean up some of the address info.   Michael Girarldo mashed up the data with the public library data for a proof-of-concept mobile app that would help users locate museums and libraries in their vicinity.

I’ve continued to play around with this data using OpenRefine and the DBpedia SPARQL endpoint.  Attempting to reconcile against no type, I found approximately 19% of the museums in the IMLS data.

Doing a spot check of the unmatched entities:

  • if it is a simply named entity,  it’s not in Wikipedia/DBpedia
  • it’s an organization that operates a museum of a different name or has a different legal name than the one it’s known by inDBpedia. e.g:
  • if it is a complex name (i.e. dirty IMLS data),  it’s not matched
  • abbreviations in the name that cannot be matched (e.g. Ntnl (national),  Ctr (center), Hist (history/historical), Inc., etc.) or conjunctions/punctuation e.g. ( & for and).
Some of these problems might be fixed by cleaning up the names,  but some of the disambiguation may require human intervention.   I haven’t tried looking up entities in DBpedia by address, but that might also help identify things that are uncertain.
Some of these mismatches do raise interesting ontological questions and gets back to the issue of a Museum (organization) vs. Museum (a place).   It looks like there’s lots of unreconciled houses, historic sites, etc.  with different legal names than the places they are associated with.   What will be the best way to represent/associate these entities?

 Mismatches

Looking at the “matched” items,  I found that lots of generically named museums were matched to a specific museum.  For example, there are many “Art Gallery” things (usually at this or that college) that were all matched to the same DBpedia entity. Likewise, there are about 15 different “Pioneer Museum”, “Museum of Natural History,” “Museum of Anthropology,” “Museum of Art,” “University Museum,” “Cowboy Hall of Fame.”   Another area of mismatch are county/city historical societies where the locality has the same name (i.e 12 different “Douglas County Historical Society” all matched to Douglas County Historical Society in Nebraska.

There are also multiple sites that are maintained by a single entity, like a state museum network or a city.  For example,  The Peale Museum (aka Municipal Museum of Baltimore) and Edgar Allen Poe House and Museum are simply listed here as “City of Baltimore” It was necessary to look up the addresses to see what museum entity was there.

So clearly, a more robust approach to reconciliation is needed, perhaps including the city/state of an entity in order to disambiguate similar names.

Lots of challenges here, but also seems to be lots of opportunities to add to and enrich museum representation in Linked Data/wiki resources.

05/31/13

Quick Museum Counts update

oh well, best laid plans…

Didn’t quite get as far as I’d hoped this week.   Following Justin’s comments in the previous post,  I did a quick mapping to the Organization Ontology,  vCard, wsg84 and schema.org (seems to be the only one with a DUNS property).

A sample in Turtle looks like and you can download the full set as Turtle:

@prefix schema: <http://schema.org/> .
@prefix v: <http://www.w3.org/2006/vcard/ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix org: <http://www.w3.org/ns/org#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
<http://chi.cci.fsu.edu/resource/museum/imls_0> a org:Organization ;
 skos:prefLabel "Carmel Valley Historical Society" ;
 org:hasSite "http://chi.cci.fsu.edu/resource/site/imls_0" .
<http://chi.cci.fsu.edu/site/imls_0> a org:Site ;
 geo:long "-121.726912" ;
 geo:lat "36.47231" ;
 v:street-address "PO BOX 1427" ;
 v:locality "Carmel Valley" ;
 v:region "CA" ;
 v:postal-code "93924" .

Since I’m not aware of an authorized URI for IMLS resources,  I just minted one for my own domain.  However at this time these don’t resolve to anything.   Ideally I might be able to reconcile many of the city/zip codes to a previously published URI for a place.

I’ll be offline for most of tomorrow, but will try to check in the evening.

05/30/13

What is a Library/Archive/Museum According to Linked Data?

In my previous post,  Hacking Museums Count,  I introduced the data that IMLS has released for the National Civic Day of Hacking and took a first stab at assigning LOD properties to the data they provided.     This was based on some of my previous work mapping IMLS DCC data.  Following my initial observations,  I decided to take a closer look at how libraries, archives, and museums are currently modeled in Linked Data resources such as Freebase, DBpedia, and Schema.org.

What is a Library?

“A library…is an organized collection of information resources made accessible to a defined community for reference or borrowing.” (Wikipedia)
  • /organization/local business/library (Schema.org)
  • /organization/educational institution/library (DBpedia)
  • /place/Architectural Structure/Building/library (DBpedia)
  • /architecture/building function/ (Freebase/Wikipedia)
  • /library (Freebase)

What is an Archive?

“An archive is an accumulation of historical records, or the physical place they are located.” (Wikipedia)

  • notably, Schema.org does not have an explicit class for archives
  • /organization/government agency/ (DBpedia)
  • /archive  (dbPedia)   although this class doesn’t seem to be associated with other top-level classes.  Wikipedia/dbPedia resources are given as instances of this class.
  • /organization/organization_type (Freebase)
  • /organization/organization_sector (Freebase)
Best one yet….  /fictional organization type! (Freebase)

What is a Museum?

“A museum is a building or institution dedicated to the acquisition, conservation, study, exhibition, and educational interpretation of objects having scientific, historical, cultural or artistic value.” (Wikipedia).

  • /place/civicStructure/Museum (Schema.org)
  • /place/Architectural Structure/Building/Museum (DBpedia)
  • /architecture/museum (Freebase)

This is only a partial listing of the kinds of thinks a LAM can be according to these linked data sources.  It raises some interesting questions about how to properly model the IMLS data.  What is IMLS’s view on this information?   The emphasis of the total dataset is on location (address, lat/long, Census Block, etc.), so perhaps this is reflected in how LAMs are represented in Linked Data.   I’ll need to go back and revisit my assumptions about LAMs being organizations that are associated with a structure of some kind.

So what?

As we talk about the convergence of LAMs,  the different ways that each sector has been represented as top-level Linked Data classes raises some interesting questions.

  • What does this tell us about public (well…LD public) perception of LAMs?  The narrative definitions on Wikipedia pages seem incongruent with the ontological classes. (is a library a collection? an organization? a building?)
  • Much of our attention as professionals is directed at representing our collections.  What is our responsibility to ensure that LD concepts reflect our understanding of what we do?
  • What is the impact on these specifications on our other Linked Data work?  Given some of the choices that have been made so far,  it could lead to some unexpected inferences:
    • <A Building> <hasCopyright> <An Image>
    • <A Painting> <isPhysicallyLocatedIn> <An Organization>
    • <A Person> <employedBy> <A Place>
The danger here (as with much Linked Data) is that we are talking about a few different entities that overlap.   The Museum (Legal Entity);  The Museum (a building); The Museum (a functional role as an organization that collects stuff).  Perhaps by looking at other kinds of entities that are similar/different to museums (i.e. non-profits, performing arts, businesses, etc.) we can see some alternatives that neatly address this problem.

 

05/28/13

Hacking Museums Count

A few years ago (has it been that long already??) I wrote about the DPLA Beta Sprint we created for the IMLS Digital Collections and Content Project (see: 12, 3). As part of the sprint, I created Linked Data representations for the IMLS DCC Collection-level records. A portion of those records included basic information about contributing IMLS DCC partners. Behind the scenes this data was used to maintain relationships with partners, but we also started using this information to build browse features and  visualizations of what the collection looked like (see the current IMLS DCC interface, my paper on Collections Dashboards).

In the course of this project I discovered a there is a fundamental ontological difference between how museums and libraries are represented in the current Linked Data cloud.   It was pretty easy to reconcile library entities, because the Public Library Survey data had been ingested into Freebase.  Museums were much more hit-or-miss.  Looking closer that the data, I  realized that libraries were usually represented as a kind of organization, but museums were considered a kind of building.  This may be because much of the information about museums in the U.S. is derived from the National Register of Historic Places dataset that was also ingested into dbPedia/Freebase.

As part of the National Civic Day of Hacking, IMLS has issued the Museum Data Challenge.   Included in the challenge is a minimal set of data on 35,000 museums.   I don’t think I’ll be able to participate directly this weekend, so I thought I’d take a look at the data that IMLS has released and see what I can do to make someone else’s hacking easier this weekend. Also included in the IMLS challenge is the Public Library Service Data (and data from the work of my colleague Christie Koonz,  imaplibraries.org).  Also check out the DPLA Challenge  and the Pocket Archivist Mobile Challenge from NARA.

Goals for the week:

  1. Do any clean up needed. (right now the data *looks* pretty clean, but the challenge suggests that their names and geolocation may be faulty).  The fact that much of the LOD about museums is from NRHP might allow me to identify inconsistencies in the IMLS dataset.
  2. Convert the CSV data into Linked Data.
    1. Identify appropriate Linked Data properties for this data (see below).
    2. Transform the CSV into JSON-LD
      1. publish on GitHub
  3. Associate these representations of museums as organizations with representations of museums as buildings in the current Linked Data cloud.
    1. Submit data to dbPedia/Freebase

Here’s a start on identifying LOD properties for the IMLS data release:

IMLS Field Description   LODProperty LOD Comment
id unique identifier   this is just a autogenerated ID number. Unclear whether this has any meaning to IMLS.
name institution name skos:PrefLabel per Organization ontology.  Alternate v:organisation-name
address institution street address v:street-address vCard
city institution city v:locality vCard
state institution state v:region vCard
zip institution zip code v:postal-code vCard
zip4 institution zip+4 v:postal-code vCard
longitude longitude  decimal degree format World Geodetic System Datum 1984 wgs84_pos#lat wgs84
latitude latitude  decimal degree format World Geodetic System Datum 1984 wgs84_pos#long wgs84
phone phone number v:tel vCard
duns DUNS number  Dun & Bradstreet Numeric Identifier org:Identifier there doesn’t seem to be an RDF property for DUNS numbers yet. Is There a better way to differentiate DUNS from EIN?
ein EIN number  Federal Employer Identification Number org:Identifier there doesn’t seem to be an RDF property for EIN numbers yet.

The IMLS data also includes the following fields,  though I haven’t been able to identify any LOD properties for these yet.  This is actually a bit surprising, since you’d think that U.S. Census data (or at least the properties of Census data) would be a solved problem by now.  For the moment, the information above seems like enough of a start, so I’ll leave these aside.


fipst FIPS State code
fipsco FIPS county code
centract seven character Census tract number
cenblock four character Census block number
fipsplc five-digit place FIPS code
fipsmcd five-digit MCD (Minor Civil Division) FIPS code
fipsmsa four-digit MSA (Metropolitan Statistical Area) FIPS code
cbsa five-digit CBSA code that identifies a CBSA area.
metrod five-digit Metropolitan Division Code
microf micropolitan flag
Metropolitan Area or a “1″  indicating a Micropolitan area
mattype geocoding match type

Next up, I’ll discuss in more depth how museums have been modeled in the current Linked Data environment and suggest some possible models for the IMLS dataset.

Next: What is a Library/Archive/Museum According to Linked Data? 

09/21/12

Publishing and Using Linked Data at DHWI

In January I will be conducting a week-long workshop on Publishing and Using Linked data as part of the Digital Humanities Winter Institute at the Maryland Institute for Technology in the Humanities.  Space is still available, so register today!

The publication of structured knowledge representations and open data on the Web opens new possibilities for collaboration among humanities researchers and cultural heritage organizations. This course will introduce participants to the core principles of Linked Open Data (LOD), techniques for building and understanding LOD models, how to locate LOD sources for research, tools for manipulating, visualizing, and integrating available data, and best practice methodologies for publicizing and sharing datasets.

For this course I will be drawing from initial work done by the Learning Linked Data project at the University of Washington iSchool, which has laid out a core inventory of learning topics.  The LOD for Libraries, Archives, and Museums community has also been actively promoting access to increasing amounts of cultural heritage information via Linked Data approaches.  Some of the questions we’ll be exploring in the workshop are:

  • what does the digital humanities community need from linked data
  • what use can we make of these large data sets
  • how we can synchronize scholarly work with the larger linked data community.

To help gain momentum for the workshop, I’ve created a wiki, called Linked Data for Humanities where I will be sharing a drafts of the syllabus, resources, and example humanities projects.   (a big hat tip to Mia Ridge and the Museums and the Machine Processable Web wiki, which has been an important resource for the LODLAM community).   If you have a humanities-based Linked Data project,  questions, comments, or recommendations for things the course should cover, please join in the conversation.

08/11/11

Reconciliation Recap

@jonvoss asked what I’d been up to related to reconciling my data, so here’s a brief account of what I’ve done over the last few weeks.   Much of this is proof-of-concept that will result in recommendations about what IMLS DCC might have to do to move towards Linked Open Data in the future. There are probably more efficient ways to program these tasks, but for the moment I’m using some simple tools that work for me.

In my previous post, I shared an example collection-level record set as RDF. I’ve gone back and simplified this transformation to leave out the representations of institutions and projects. Turns out the URIs that are present will resolve to a vCard RDF representation. e.g. http://imlsdcc.grainger.uiuc.edu/Registry/Institution/?1316 wil return some XML. Maybe not the best representation, but we can work on that as a separate problem. This has the benefit of making the CLD instances simpler. I made a small change that will still associate a project with a funding agency (to demonstrate the contributions they’ve made).

Using the SIMILE Gadget tool, I’ve also extracted unique terms & frequency counts from the CLD records(1). These terms/frequencies are then imported into Google Refine and reconciled against appropriate LOD data:

Using Freebase has been pretty painless.  When a column of terms is reconciled,  Refine stores the Freebase ID.   To get the Freebase URI,  simply create a “New Column Based on This One” using the following GREL

“http://rdf.freebase.com/ns/m/”+cell.recon.match.id

Using this Freebase URL I can replace the literal statement

<dcterms:spatial>Illinois (state) </dcterms:spatial>

into and linked data statement:

<dcterms:spatial rdf:resource=”http://rdf.freebase.com/ns/m/03v0t” rdfs:label=”Illinois (state)” />

Reconciling against id.loc.gov has been more difficult. From my literal values I can create a query string (sometimes) fetch the correct set of triples for a term. This works for most of our terms, though a few uncontrolled terms have been contributed by participants that don’t match. e.g. http://api.talis.com/stores/lcsh-info/items?query=preflabel:photographs&max=1

It is a little sensitive to plural/singular terms. For example the difference between “scrapbooks” and “scrapbook.” Most terms are plural, but there seems to be some distinction I don’t understand between Painting and Paintings.

In Refine I can pull back the RDF for these terms, but am still working how how I might extract the canonical concept URI for each term. This looks like it will require parsing the RDF to get the right URI out of it. If anyone has a good cookbook for reconciling terms with id.loc.gov URIs, I’d love to see it. (something using the Refine ReconciliationServiceAPI would be swell).

I may give our subject headings a twirl, but I may need some subject cataloger help there.  The published LC authorities include headings like “Cemeteries – Recording” but not localized forms “Cemeteries – Recording — Illinois”  Since these are all strings in a dc:subject, some way of parsing the subdivisions is needed.

Update: After posting this, I started playing with my subject headings and found that the LCSH triples were loaded into Freebase in May. (http://www.freebase.com/view/topic/en/loc_subject_headings_full_load). The Refine reconciliation service will pick them by creating a “namespaced” reconciliation service. (point it at the Library of Congress Namespace). Now, let’s get those names & other vocabularies loaded!

(1) I’ve tried to replicate this in Google Refine, but on my computer it seems to choke on the complex XML record structure. It’s quite happy with large tabular representations though.

07/4/11

Piloting collection-level LOD for IMLS DCC

On Metadata
Since the #lodlam conference, I haven’t had much chance to play around with my shipyard LOD — the dissertation calls. Plus I’m spending about half my time this summer as part of the team working on the CLIR/DLF/IMLS DCC Beta Sprint for the Digital Public Library of America (DPLA).

What follows is a bit of skunkworkery that I’m doing for self-edification & also to help suggest ways we can make IMLS DCC data more LOD friendly.**  Currently people can browse the site at http://imlsdcc.grainger.illinois.edu/history or as XML via OAI-PMH for collection-level and item-level metadata.   As part of the Collection/Item Metadata Working Group (CIMR),  I helped build an RDF testbed that was oriented towards our research problems.

Using some of the stylesheets developed for CIMR, I’ve generated LOD representations for the currently available collection-level records.  When the rubber hits the road like this, there are lots of design choices you can make – in terms of encodings,  which vocabularies to use, etc., etc.  Here is a sample set of records and the XSLT used to generate them from the OAI-PMH, imlsdcc listRecords format.  Some questions:

  • this looks rather complicated.  Maybe that’s OK, as it seems to represent much of the information currently shared publicly by the project. I’d welcome any suggestions for simplifications or better approaches to representing this as LOD.
  • are there best practices for representing organizations as organizations?  FOAF/vCard seem very oriented towards people (who have associations with an organization).  I also picked up the Organization Ontology from Describing Libraries, Their Collections and Services in RDF.
  • Many of the URIs here are just made up for demonstration purposes.
  • There are lots of organizations we have minimal information for. It would be nice to reconcile our URIs with other published URIs for these institutions. What would be the most authoritative source for that LOD?
  • Many organizations aren’t publishing their own “authorized” graphs for themselves.  Is this something a project like IMLS DCC should consider?   I added a stub description of IMLS DCC to this file to demonstrate the relationships between the project and the aggregated collections.
  • Right now this RDF mostly contains the strings found in the original XML.  I would like to reconcile controlled terms where possible to existing LOD vocabularies (like id.loc.gov,  language terms, formats, etc.).  I think that would make this data more “linked.”
  • In theory the XSLT above should still work with the SIMILE OAI-PMH RDFizer

Thanks if you have a chance to take a look and offer comments on this.  And do let me know if you’d like to see more of this kind of data!

** Disclaimer: this is some work I’m doing on the side, on my own. Neither the rdf nor the XSLT should be considered an “official” release by the project. Any mistakes here are mine.