Inherent Vice
inherent vice: n. ~ The tendency of material to deteriorate due to the essential instability of the components or interaction among components.
SAA Glossary of Archival and Records Terminology

Archive for the 'metadata' Category

Archival Research Catalog on Data.gov

Friday, January 29th, 2010

Description Peddlers and Data.gov: Two Peas In a Pod
As you may have heard, the National Archives issued a press release today announcing the release of three data sets on Data.gov:

The first milestone of the Open Government Directive was met on January 22 with the release of new datasets on Data.gov. Each major government agency has uploaded at least three datasets in this initial action. The National Archives released the 2007–2009 Code of Federal Regulations and two datasets from its Archival Research Catalog. This is the first time this material is available as raw data in XML format.

Read more on The Secret Mirror

    and ArchivesNext:

    or Fred2.0

    Original ARC data is available at data.gov

    More fun with Pipes – Champaign Urbana Historic Built Environment

    Sunday, May 3rd, 2009

    Earlier this year I tried to start a “365″ project on Flickr. The basic idea is that you take a new photo every day and contribute it to a pool. I”ve been a dismal failure at this so far this year, even after trying to re-start my project by begining a “Then and Now” project based on the Champaign-Urbana Historic Built Environment collection.

    The Champaign-Urbana Historic Built Environment Photograph Collection offers a selection from the holdings of the Champaign County Historical Archives, which was established as a department of The Urbana Free Library in 1956. Among its holdings of books, manuscripts, and maps, the archives has preserved over 50,000 photographs of local people and locations. This collection provides a sampling of the rich visual history of Champaign-Urbana’s historic built environment in the 19th and 20th century, including images of residential, commercial, governmental, educational, medical, and religious structures, and thus reflects the notion that historic buildings serve as an entryway into the community’s collective memory.

    The Champaign-Urbana Historic Built Environment Photograph Collection is a joint project of the Champaign County Historical Archives at Urbana Free Library and the Library of the University of Illinois at Urbana-Champaign.

    This was as far as I got on this project:

    www.flickr.com

    One thing that was becoming clear is that I needed some easier way to locate the next historic building for me to shoot. Since I was trying to replicate the view in the original photo I’d also need to be able to see it. Champaign has yet to be blessed with 3G, so it was painfully slow to browse to the ContentDM site and try to search for something, scroll through a list, etc. etc. The CUHBE collection DOES include the address of the site when know, but the address has been broken up into two separate fields, neither of which appear in the short display. There had to be an easier way to get to these records.

    Piotr was able to build a Pipe that parsed the OAI_DC output from ContentDM (more coming soon from him on this) into various PIPE formats. This was a good step forward, but I still couldn’t see the addresses of the historic buildings. By adding a string builder module to the Piotr’s pipe, I now get the name of the building along with it’s address. Now, what I’d really like to do is put these locations on a map, but the location builder doesn’t seem to like the addresses in here – I’m sure with a little more poking I can get it to work, stay tuned!

    Champaign Urbana Historic Built Environment Pipe

    Patchwork Prototyping a Collection Dashboard

    Tuesday, April 14th, 2009

    this is a re-post of a new series of discussions that will be taking place on the IMLS DCC Project Blog.  If you’d like to comment, please see the original post.

    Patchwork Prototyping a Collection Dashboard

    The IMLS Digital Collections and Content Interface research group is kicking off a new line of inquiry this week that will explore how we might build a “Collections Dashboard” for the DCC.

    The Problem

    According to user studies that we’ve conducted, users rarely find the full-text collection descriptions that we provide very helpful. The long screens of text scare them away and don’t really help them find what they are looking for. In the current iteration of the interface, if I stumble across an interesting item, it can be difficult to even find your way back to a collection-level description. The problem here seems to be that the notion of how and why collection-level descriptions are created is based on an old model that looks like this:

    A Traditional Path to Items

    A Traditional Path to Items

    But increasingly, the way we find things – particularly in online environments looks more like this:

    A Digital Path to Items

    A Digital Path to Items

    Nina Simon takes this notion one step futher, by suggesting that we increasingly come at things indirectly through our social network.

    In both of the latter cases a user may lack any understanding of institutional or collection context and may be left wondering just where they’ve ended up. As an aggregation of other people’s metadata, trying to orient the user of an item towards these context can be even more difficult. At present the IMLS DCC contains records from more than 500 collections, 240 different repositories for a total of more than 900,000 item-level metadata records. Simply flattening this out into a large blob of item-level metadata separates items from their contexts. (even Google has its page rank that organizes what appears at the top of your results list according to their place in the networked world).

    For certain kinds of users, this kind of context isn’t really what they are interested in. They’ll be happy to find an item and move on to their next search. But for the students and scholars that are our primary focus in this part of the grant, context can be a very important part of their research process. A recent study of scholars who use physical object collection, conducted by the UK’s Research Information Network (RIN), illustrates the problem nicely. Collection-level descriptions, such as those offered by the Cornucopia project, offered insufficient information to meet the scholars needs. But interestingly, this same set of scholars said that item-level descriptions lacked information about contexts that make these items meaningful and valuable for their research. How can we restore that sense of both item-level granularity, while maintaining the rich contexts that these items come from?

    A Solution

    One of the main goals of the current phase of the IMLS DCC project (and particularly for the Collection-Item Metadata Relationships research group) has been to take advantage of collection-level and item-level metadata when used together as mutually supportive forms of description. For the interface group, we’ve been asking ourselves what this might mean in light of our usability studies that suggest the long textual descriptions scare people off.

    What if we could provide users of the system a quick, easy way to get a 10,000 foot view of a collection? From this vantage point, individual items fall back to reveal the larger contours of a collection landscape. What are the high points? Where are there gaps? Does this look like a promising place to dig deeper for the kinds of items that will answer my research questions? What kind of landscape does this item come from? Will this collection lead me to find other things like it?

    When we visit a physical collection all these kinds of information contexts come for free. We know that we’re under the dome of the Library of Congress or foraging in a tightly packed storeroom at the Early American Museum. I can walk down the ranges of my library and count off how many shelves the E 302 Collected Works of American Statesmen takes up. I can gauge how much work it will be to browse through 6 linear feet of archival materials or 600. I know it would take me days, if not weeks to tour the Louvre, but only a few hours to visit my university gallery. In our digital collections it can be hard to tell how vast, how diverse or how cohesive any one collection might be – let alone an aggregation of more than 500.

    collectiondashboardIn order to do this we’ve borrowed the idea of “information dashboards” that are commonly found in enterprise settings where executives need a high-level overview of underlying processes (see Stephen Few’s book Information Dashboard Design. The Indianapolis Museum of Art was the first to apply this idea in a cultural heritage setting, but like its fore-bearers the IMA dashboard focuses on some of the dynamic processes at work in a museum setting. For the IMLS DCC Collection Dashboard, we’d like to extend this metaphor to represent the key features of a collection in a visualization that is quick and easy to understand.

    Prof. Mike Twidale and I have setup a temporary demonstration space here where our evolving prototypes will be posted. Watch this blog space for more information and for opportunities to participate virtually in the design. We would particularly like feedback and comments from scholars who use historical collections about what high-level collection features are most useful for assessing a collections value for your research.

    You are also invited to participate at the following upcoming conference venues:

    Next Post: I’ll talk about the “patchwork prototyping” method we’re using to attack this problem.

    Modelling CDWA Lite as an OWL-DL Ontology

    Thursday, March 26th, 2009

    Ooops….after the iSchools 2009 conference, I updated a page on my website that contained my poster “Modelling CDWA Lite as an OWL-DL Ontology” but never posted anything here at Inherent Vice.  You can also download the full poster from the IDEALS repository.

    I’ve also just posted the beta version of the OWL file on my website as well. I do this with some trepidation, since this is probably the first full OWL model that I’ve created from top to bottom. As I note in the paper, the current structure of the CDWA Lite XML schema forces ontology developers to make some choices about how certain parts of the schema are modelled in an ontology.

    This was a useful learning exercise, but I’m not sure if I will take this particular OWL model forward. I had intentionally avoided using the CIDOC-CRM and the improvements suggested by the MuseumDAT project. CDWA and CDWA Lite have enough of a toehold here in the United States and had impacted other influential standards such as the VRACore and Cataloging Cultural Objects. I felt that it deserved a fair shake to stand on its own. But some of the problems I encountered in trying to create an OWL model suggest that modeling CDWA using CRM would be a worthwhile next step.

    If you’re working on a similar project I would be interested in hearing from you and would appreciate any comments or feedback on the ontology itself.

    The URI Gap

    Tuesday, October 7th, 2008

    Two weeks ago I attended the 2008 International Conference on Dublin Core and Metadata Applications. Dr. Allen Renear, Karen Wickett and I were there presenting our paper (well, Allen did all the presenting) Collection/Item Metadata Relationships.



    A Semantic Web Layer Cake. Modified from the original at http://semtext.org/2004-02/slides/img4.gif (thanks to Karen for pointing this out!)

    There was a fair amount of Twitter activity during the conference and during Ed Summer’s talk about “LCSH, SKOS and Linked Data” I started an exchange about URIs and their role in the Semantic Web. Actually the increasing “semaniticness” of the Dublin Core specifications has been on my mind for a while. When I first started encountering it several years ago it was impenetrable to me as someone who’s technical skills were mostly acquired one the job. I’d mastered relational databases and was becoming proficient in XML, but the emergence of the Abstract Model presented more of a challenge. My mind would drift back to the days where I’d be standing in front of thirty or so librarians, archivists and museum professionals at a CDP workshop – how would I explain the Abstract Model to them? And more importantly how would they actually participate in a “semanticaly” enabled CDP?

    At one point Ed quotes Andy Powell:

    …by treating values as non-literal resources and assigning URIs to
    them we give ourselves (and others) the hooks on which to hang further descriptions.

    (I’m not going to rehash existing discussions about 1) what is a URI and 2) what are literals and non-literals. Also see Pete Johnston’s “Dublin Core Key Concepts” tutorial slides).

    This idea of replacing literals with non-literals in our metadata is certainly attractive, especially in a robust networked environment. What I haven’t yet heard is what happens when the network is brittle and things start breaking. It seems possible that the neat web of relationships that we’ve identified could quickly start unraveling itself. This seems especially true in an environment where metadata gets aggregated away from its original creator. Sure, in your shop you may know that you’ve “minted” URIs for new properties or replaced old URIs with new ones, but the metadata that you’ve released into the wild may not know about these changes. In these discussions about replacing literals with non-literals there always seem to be some assumption that the non-literals will a) be globally unique and b) be persistent. As Andy Powell suggested via Twitter the scenarios where this isn’t true are not a technical failure of the semantic web, but a social/political/commitment failure on people implementing systems. No doubt this is true, but in my book the people problems are always harder to solve than the technical ones.

    Take the CDP’s aggregation of Dublin Core metadata as an example. When I was there I’d made a private commitment to keep the percentage of bad URLs below 10%. You might think this was easy, but in fact was quite a lot of work – largely because many of our partners (and their partners) hadn’t bought into the belief that URLs needed to be persistent. Sometimes a simple change on their end that was automatic didn’t make it way to us and required manually updating every record. This problem cascades beyond CDP to the IMLS DCC item-level repository which also now contains records with bad URLs. Even though the DCC repository could potentially revise its records through OAI-PMH, CDP’s OAI data provider disappeared about a year ago when a server was replaced. We now have several layers of social/political/commitment between us and the resource that we are describing or wanting to retrieve.

    Several studies have been conducted that show various rates for “linkrot” in URLs, but I have yet to find any references to the expectations/reality of “URI rot.” With millions upon billions of URIs being “minted” (they are the coin of the semantic realm after all), having even a small portion of them fail seems like it could wreak havoc on the neat and tidy graphs that are the basis of the semantic web. This also would seem to be a concern for long-term digital preservation in the case where the services, etc. that your relied on today may have long since disappeared. Recommendations like “coolURIs” help address the technical issues but they don’t seem to address the “people” problem.

    And what of the resource-strapped (as in cash, manpower, etc. not as in “things” being described) cultural heritage institutions? Will they really be able to mint robust and long-lived URIs? Or will they be relegated to the backwaters of the un-semantic web? Just as there has been a gap between institutions that are able to get their collections online, we now could have a growing divide between those who are able to provide semantically enhanced metadata. Again, a political/social problem as much as its a technical one.

    Perhaps “semantifying” metadata could be a new job for metadata aggregators like IMLS DCC. I could image a service provider adding a processes to their workflow that would append URIs for known controlled vocabulary terms to aggregated records or provide new URIs for things that didn’t have one already. This seems to point towards the top layer of the semantic layer cake – that of trust. Is it necessary to know who has the “authoritative” URI for a resource or property? What are the politics/social issues involved in taking responsibility for URIs for someone else’s “things?” If there are multiple URIs, how do I know that they point towards same “thing?” Should I mint a new URI for one that has failed?

    At times I feel like the “Semantic Web” buzz is just swapping in a new technical platform without really addressing the social problems that prevented us from achieving similar goals with older technologies like XML. Jerry McDonough discusses his concerns with regards to XML in his recent Balisage article, “Structural Metadata and the Social Limitation of Interoperability: A Sociotechnical View of XML and Digital Library Standards Development.”:

    Like a rope, [XML] is extraordinarily flexible; unfortunately, just as with rope, that flexibility makes it all too easy to hang yourself.

    In the case of the semantic web, I may be less worried about hanging myself and more worried that the rope I’m hanging onto might be cut someone up above at any time – sending me and my metadata into the abyss. It also seems that addressing some of these concerns could encourage more uptake of semantic web technologies, especially where social/political/financial commitments are required to make it happen. Looking back to the lessons we’ve learned (or have yet to learn) from our experiences with XML, metadata interoperability, and shareability would make me feel more comfortable relying on the “cloud.”

    Papa’s got a brand new bag (of metadata)

    Wednesday, August 23rd, 2006

    This fall will mark another series of changes for me. It will be my first semester as a doctoral student, and I’m pleased to announce that it means a change of job as well. Starting September 1st, I will be working as the graduate assistant for a new IMLS funded grant “Metadata for You & Me: A Training Program for Shareable Metadata.” This grant builds on the experience of the OAI community and the development of the Best Practices for OAI Data Provider Implementations and Shareable Metadata. For me, it will also builds on the experience I collected at CDP manually manhandling a varied collection of metadata into Dublin Core and the many hours spent working on the CDP Dublin Core Metadata Best Practices under the guidance of a great working group. This time around I’ll get to work with my good friend Sarah Shreeves and the Inquiring Librarian herself, Jenn Riley. Expect to see some of my thinking here, particularly as it relates to the implications for museums and other cultural heritage organizations!

    Oh, and speaking of shareable metadata – I promise to get Technorati tags working here soon. The “categories” list is a little uncontrolled at the moment and needs some tending.

    • <div> of Shameless Commerce