Editor's note: The charts that accompanied the print version of this article are available in the PDF version (see sidebar).
Recently there has been a shift in popular approaches to large-scale metadata management and interoperability. Approaches rooted in semantic Web technologies, particularly in the resource description Framework (rdF) and related data modeling efforts, are gaining favor and popularity.
In the library community, this trend has accelerated since the World Wide Web Consortium (W3C) re-framed many of the Semantic Web’s enabling technologies in terms of Linked Open Data (LOD)—a lightweight practice of using web-friendly identifiers, explicit domain models, and related ontologies to design graph-based metadata. Since that shift, the library metadata community has become an increasingly major contributor to the “global graph” of linked data. The emergence of linked data for libraries began with the Library of Congress publication of LCSH (Library of Congress Subject Headings) in SKOS (Simple Knowledge Organization System) and the Swedish National Library’s publication of the LIBRIS
Union Catalog as linked data. Since then, major publishing efforts have come from the German and French national libraries, the British Library, and initiatives like Europeana, which include museum and archival data as well as data from libraries. Already, the Summer of 2012 has seen OCLC launch major linked data initiatives and the Library of Congress begin work on a Bibliographic Framework Transition Initiative based on Linked Data.
As more and more RDF-based metadata become available, a lack of established best practices for vocabulary development and management in a Semantic Web world is leading to a certain level of vocabulary chaos. The situation is aggravated by a dearth of tools for discovering and selecting existing vocabularies. This “embarrassment of riches” could be viewed as troubling proliferation or as welcome activity expanding the availability of viable approaches to description. Either way, strategies for vocabulary publishing, discovery, evaluation, and mapping have the potential to change the conversation significantly.
For the purpose of this article, “vocabulary” refers to metadata element set vocabularies (ontologies): collections of classes and properties used to describe resources in a particular domain. While many of the infrastructure components are also relevant to the management of “value vocabularies” (also called controlled vocabularies), the examples herein will be about metadata element sets. Such clarification is necessary to establish basic contexts for data expressed in the one-size-fits-all simplicity of RDF. Metadata element sets and value vocabularies, along with datasets, are contexts recently defined and scoped for archive, library, and museum linked data.
Until recently, vocabularies were considered to be tied tightly to particular domains and applications. In the library world, most vocabulary development was in the context of MARC 21, and similar development trajectories occurred within other domains of practice. The first public glimmer of a less siloed approach appeared in 2000, when Heery and Patel published their seminal article on Application Profiles, a notion taken up with enthusiasm by the DC (Dublin Core) Community. The idea that vocabularies could be “mixed and matched” to improve both usefulness and interoperability was a potent one, and from that idea grew greater interest in what might be “out there” that could be reused without additional vocabulary proliferation, or the overhead of vocabulary development by every project or domain.
Even before that article, as early as 1999, metadata practitioners had begun to experiment with the idea of Application Profiles. For those innovators, the need for an infrastructure to manage discovery of and documentation for the various schemas from which terms are drawn became very clear. Early examples of work in this area include the UKOLN DESIRE Metadata Registry, the European Commission funded Schemas Project, and its successor CORES.
These tools became known as registries, and in 2002, the Dublin Core Metadata Initiative (DCMI) launched its own Metadata Registry. According to Heery and Wagner (the DCMI registry’s initial developers):
Metadata schema registries are, in effect, databases of schemas that can trace an historical line back to shared data dictionaries and the registration process encouraged by the ISO/IEC 11179 community.
This work has inspired a number of other registries, including the Open Metadata Registry (OMR); the current version of the DCMI Registry, which has provided the basis for a national Japanese Metadata Infrastructure Registry; and the JISC Information Environment Metadata Schema Registry.
The OMR, among the most active of this group currently, began as the NSDL Registry, a National Science Foundation- funded project within the U.S. National Digital Library program. It was built as a free, open service and among its most important functions is the ability to provide detailed versioning of changes at every level. It has been used extensively in the library community, now hosting the vocabularies of RDA (Resource Description and Access), ISBD (International Standard Bibliographic Description) and the FR family of models (Functional Requirements for Bibliographic Records/Authority Data/Subject Authority Data) developed by IFLA (International Federation of Library Associations and Institutions), and the experimental version of MARC 21 in RDF discussed below. The OMR is now engaged in a significant redevelopment effort, focused on vocabulary mapping.
The DCMI Registry Community, established in 1999, became a central place for the discussion of the development, management, and functional requirements for metadata registries. In 2009, UKOLN, working with the DCMI Registry Community, produced a survey of Metadata Registry users and owners to identify current practice of the systems and functional requirements for vocabulary management and inter-registry interoperability. The survey, still unpublished, was completed by 12 registry owners, including most of the major active registries above, 10 self- identified application developers looking to programmatically consume registry content, and a number of other end users —to total 35 respondents.
Discrepancies between end users’ needs and system functionality were seen in responses relating to types of content registered, services provided, and the data formats and methodologies used for access to content.
The chart in Figure 1, with application developers marked in dark blue and labeled “yes,” shows a clear desire for machine-readable, API-based access to version history. Contrasted with Figure 2, showing that over half of the registries had no version control or did not expose that information to users, the discrepancy between the needs of registry users and the state of registry software development is evident.
The results showed that the focus of registries was becoming less about discovery of relevant vocabulary terms for mixing and matching, and more about infrastructure for managing those vocabularies, vocabulary version control, and mapping between vocabularies. Bill de hÓra, in a 2007 blog post, stated the issues succinctly:
There are two schools of thought on vocabulary design. The first says you should always reuse terms from existing vocabularies if you have them. The second says you should always create your own terms when given the chance.
The problem with the first is your [sic] are beholden to someone else’s sensibilities should they change the meaning of terms from under you (if you think the meaning of terms are fixed, there are safer games for you to play than vocabulary design). The problem with the second is term proliferation, which leads to a requirement for data integration between systems (if you think defining the meaning of terms is not coveted, there are again safer games for you to play than vocabulary design).
What’s good about the first approach is macroscopic–there are less terms on the whole. What’s good about the second approach is microscopic–terms have local stability and coherency. Both of these approaches are wrong insofar as neither represents a complete solution. They also transcend technology issues, such as arguments over RDF versus XML. And at differing rates, they will produce a need to integrate vocabularies.
Bibliographic Standards Communities
IFLA and JSC/COP (Joint Steering Committee for Development of RDA and Co-Publishers) are using the OMR to develop and administer RDF namespaces representing de-facto international bibliographic standards. These include the FR family, ISBD, and RDA. While technical advice and support for all of these namespaces has been provided by a small team, which includes three of the authors of this paper, the development of each set of namespaces has been largely autonomous between the standards’ management infrastructure. This has identified a range of management issues to be considered.
RDA was the first of these standards to use a registry, to meet the goals of the DCMI/RDA Task Group. The development of element sets and value vocabularies for RDA has taken place in an open environment, with benefits for maintainers and consumers. Version control has allowed the long development path to be monitored by external applications. The RDA namespace was created in 2008; as of July 2012 the element sets and many of the value vocabularies remain in a mutable state. Yet the visibility of status and development history has allowed experimental applications—such as those discussed below—to use RDA classes and properties in appropriate contexts. Access control allows multiple agents to work at their own pace and to develop flexible agendas for tasks such as language translations and synchronization with other documentation. Progress of, and feedback on, such work is easily monitored by colleagues and other interested parties.
The development of the RDA namespace immediately stimulated the IFLA communities to consider the potential use of their own standards in the Semantic Web, as RDA is based on the FR family. The FR element sets have followed the same development sequence as the standards, and the semantic analysis involved is informing a current process of consolidation into a single model. ISBD is developing a DC Application Profile to state requirements for a well-formed ISBD record, including mandatory and repeatable status
of elements, aggregations of elements into higher-level statements, and sources of value vocabularies. IFLA is also considering best practices for the translation of its element sets and value vocabularies, as it operates in a multilingual environment and recognizes seven official languages for its activities. Parts of the ISBD and FR family namespaces have been translated from English into Spanish and Croatian; translations of the underlying documentation are available in multiple languages, which might eventually be applied to the namespaces.
Reuse of RDA elements was rejected because the natural flow is to refine the application from the model. In turn, ISBD did not reuse FR elements because there was, and remains, no complete agreement on the semantic relationship between the two standards. A discussion on unconstrained namespaces for mapping between IFLA and other community metadata element sets is emerging, stimulated by work on alignment of ISBD and RDA elements to improve interoperability.
This formalized and more comprehensive approach to bibliographic data is a marked contrast to earlier efforts to reuse more domain-neutral vocabularies—Dublin Core, Bibliographic Ontology (BIBO), Friend of a Friend (FOAF)— in many of the European national libraries’ efforts to publish RDF representations of catalog data. Though early efforts at publishing linked library data varied in the complexity of their data model, all relied heavily on reuse of vocabularies already in wide use on the Web. Some, such as LIBRIS’s trailblazing efforts, the British Library, and Cambridge University, applied existing vocabularies like BIBO and FOAF. Such projects often feature simple modeling of a few FRBR classes; associated entities representing agency, such as authorship and publication; and other entities representing aboutness, including people, places, time-periods, and topics. Others, such as the British Library’s efforts, were heavily specified, with classes for information related to series, subjects, publication events, and agents. The German National Library reused DC, FOAF and SKOS along with the RDA Vocabularies described above.
The Cambridge Open METadata (COMET) project, in particular, set another powerful precedent toward best practice by making all of their conversion utilities, tools, code and processes available under an open source license. There is a tremendous amount of value to all of these approaches. Both the comprehensive efforts to model the rich depth of MARC 21, RDA, and ISBD and the more selective exposure of key information from that data using more common web vocabularies are important aspects of current experimentation in linked bibliographic data.
This is evidence, indeed, of the shifting balances of the macroscopic and microscopic approaches discussed by de hÓra. This has set the stage for a shift of focus in registries to the management of maps and mappings, as well as application profiles.
The Case for Mapping
The mapping of a semantic relationship between an RDF property with another RDF property or class can be associated with an inference rule that enables the processing of data expressed using the origin property. Processing results in the generation of a new RDF statement that can be used in the environment of the target property or class. Best practice results in many bibliographic schema attributes and relationships being expressed as RDF properties that can be included in a map (sets of mappings) as an RDF graph or ontology.
Figure 3 shows an RDF graph that maps properties with overlapping semantics for the concept “extent of a bibliographic resource.” The properties are taken from the namespaces of Bibliographic Ontology (bibo), Dublin Core terms (dct), FRBR entity-relationship model (frbrer), ISBD, MARC 21, RDA, and a proposed community-shared high-level “commons.” All links in the graph are the RDF Schema property rdfs:subPropertyOf, indicating a broadening of meaning in the direction of the arrow. Data using any of these namespace properties can be propagated in that direction, losing detail but preserving coherency in a “dumb down” process that provides interoperability from local to global levels.
Similar RDF graphs can be constructed for value vocabularies using the SKOS property skos:broader. It is a trivial technical task to incorporate vocabularies into such maps, although the information and expertise required to determine the target of each mapping should not be underestimated.
Figure 4 shows a suggested map for a single property from the info vocabulary and equivalent properties in the oclc:library, ISBD, and RDA (free) vocabularies showing the domain and range of each. The MARC 21 vocabulary is intended to provide a completely lossless semantic mapping from MARC 21 to RDF. The URIs for each individual property have a consistent construction of [tag][indicator 1][indicator 2] [subfield] and are designed to be programmatically constructed in order to support efficient machine-transcription. The vocabulary is specifically designed to support mapping to related bibliographic vocabularies such as ISBD, FRBRer, and RDA as well as ongoing progressive enhancement.
Note that “natural” mappings to FRBRer and RDA in this map have been removed because of the incorrect inference that the resource is a “Manifestation”. The application of multiple inference rules from a complex graph can result in semantic incoherence.
Figure 5 shows a pseudo-RDF representation of the additional metadata entailed (inferred) by the use of a single “Place of Publication” property describing an OCLC bibliographic resource and the multiple inference of its “type”, using the map in Figure 4. Note the refinement and increased accuracy of the description of “Place” provided by the oclc:library mapping to the original MARC 21 property. An added Google Maps URI for the actual location provides an additional enhancement.
Many expressions of MARC 21 in RDF have made the natural decision to optimize and harmonize the mapping from the necessarily complex MARC 21 syntax, with its need to express values as literal strings, to a more resource-oriented RDF, focusing on simpler descriptions of related resources as first-class entities in their own right. This is the approach taken by the British Library, LIBRIS, and other projects described earlier. While there is significant value in this optimization, there is much to be gained by also providing the original values mapped to their direct RDF equivalent. Figure 5 illustrates the value of a detailed expression of the complete MARC 21 semantics in the marc21rdf.info vocabulary: bidirectional semantic equivalencies and subclasses can be expressed based on simple low-level mappings between semantically equivalent properties. As this example shows, by mapping at the lowest lexical level between vocabularies designed and maintained by different communities of practice, an enhancement to one can easily become an enhancement to all. Figure 5 also shows the potential for unnecessary and perhaps inaccurate entailments caused by the assignment of a too-restrictive domain. The RDA (free) vocabulary is a domain-free version of the more restrictive RDA vocabularies that was created to be used to minimize these inaccuracies when necessary.
The Role of DCMI
During the keynote for the Dublin Core 2010 meeting in Pittsburgh, Michael Bergman prompted a change in the conversation for many members of the registry community. Though registries were deemed still important, the focus shifted to their part in the general infrastructure for the management of vocabularies. Bergman’s main point was to highlight an opportunity for the DCMI: given the fact that vocabulary proliferation was showing no signs of abating, he saw an emerging need for vocabulary alignment, co-referencing, and interoperability. This focus on “alignment” can be seen as somewhat analogous to the established practice of developing crosswalks between record-based (usually XML) metadata structures. Vocabulary alignment, in contrast, identifies equivalencies and other kinds of relationships between individual metadata elements to help enable the application of those properties outside the context of their source vocabularies.
However, as the notion of an open linked data environment expands, the situation we’re facing is much more complex than it looks initially. As Dunsire, et al. note:
The meaning of “mapping” changes radically on moving from a database and record based approach to an open, multi-domain, global, shared environment based on linked data technologies—where anybody can say anything about any topic, validity constraints are not acknowledged, a nearly infinite number of properties can be defined to describe an infinite number of entities, and authority is multi-dimensional and often ephemeral. The classic approach to such apparent chaos is to attempt increased control, increased filtering, increased restrictions, and limited access. This approach hinders appreciation of the broad diversity of perspective that comes with a world of open data.
Following up on the DC-2010 conversations sparked by Bergman, DCMI held a special pre-conference session at DC-2011 in The Hague to identify the vocabulary management and alignment issues bedeviling the implementer communities associated with DCMI and see where DCMI could support efforts to come to grips with these issues. The result was the chartering of the DCMI Vocabulary Management Community charged with identifying issues of best practice and intelligent implementation that could lead to better interoperability and harmonization across institutions, projects, and language communities.
The issues surfaced in the discussion at that session revolved around the practical problems of finding, evaluating, and using vocabularies. A strong thread of concern about vocabulary quality and preservation underpinned the entire session—and has continued. The session conversations were intensely practical, and the questions that arose in them continue to reverberate within the Community as the group sets priorities and begins a more virtual stage of activity. The three focus areas at this point are planning for best practice guidelines around vocabulary evaluation, selection, and reuse; examining more closely the issues around vocabulary sustainability and preservation (including discussion of possible roles for DCMI); and the development of a set of best practices for principled extension of vocabularies.
A common interest in multi-lingual vocabularies also surfaced at the meeting, and conversations about available standards and tools for developing and managing vocabularies in many languages provided evidence of a strong interest in these issues. Though not surprising in an international group, this focus area will continue to be on the radar of the Vocabulary Management Community.
Significant contributions to those conversations in The Hague were made by Bernard Vatant of the Linked Open Vocabularies (LOV) Project. Bernard and his team have been collecting information on extant property vocabularies and exploring the relationships between them, such as whether one is based on another, or extends, generalizes, or has declared equivalences with other vocabularies. This overview of the landscape, and the excellent visualization tools provided on the site, provide significant value for implementers building related services and views, as well as to the community at large identifying vocabularies at risk. The LOV project has used its research to provide recommendations for describing vocabularies so that they can be connected at the top level and viewed in relation to the larger vocabulary environment.
Bernard also brought forward an initial proposal for mappings between DC properties and the schema.org vocabulary that had been announced a few months beforehand. An impromptu breakout session reviewed the first draft
of those mappings and proposed a DCMI task group to flesh out and get feedback. That group is currently actively managing a prototype set of mappings using a GitHub-based project repository.
Discussion and Conclusions
Though the efforts described here represent well over a decade’s worth of evolving thinking and practice, there’s still a great deal to do before the vocabulary infrastructure supporting the ever-emerging Semantic Web matures sufficiently to definitively prove its worth. In the absence of top-down agreements and development planning (such absence being a “feature” of the Semantic Web in general), much of this trajectory will, of necessity, look somewhat chaotic. But given the sheer number of new and continuing efforts to expose linked data—particularly bibliographic data—the inspiration to redouble the push for supporting infrastructure that can effectively manage this chaos can’t be denied.
For an example, during an update session on the Library of Congress’s Bibliographic Transition Framework Initiative, Eric Miller of Zepheria noted that there are now a number of projects that publish linked bibliographic data. He also noted that each of these is developing its own approach to the modeling and vocabulary selection in their data—a common practice in other early attempts to apply linked data. Recognizing that an important design feature of RDF is that metadata vocabularies are easy to define, are (optimally) self-describing to enhance interoperability, and can be used recombinantly (drawing from a variety of vocabularies in a single resource description), a relatively clear upgrade path to improvement of that data can be seen as part of the benefit of the infrastructure in the process of development.
The wide ranging conversations at the DCMI special session in The Hague remind us that interoperability and the efficiencies of common approaches require guiding principles and best practices around decisions for reuse, extension of existing vocabularies, as well as development of new vocabularies. Without cooperative efforts to develop those supportive pieces, good decisions are difficult to make, much less implement.
The role and functionality of metadata registries in the linked data infrastructure remain in flux. The requirements for macroscopic and microscopic approaches jostle for development priority, although support for vocabulary mapping functions allows a “have your cake and eat it too” balance to be maintained by ensuring that the output from both approaches is interoperable. Maps available from open registries extend the LOD environment by bringing what would otherwise be exclusively “local” vocabularies and mappings into the open domain.
It may well be that the growing interest in mapping and alignment, rather than the earlier misplaced concern around vocabulary proliferation, will fuel an important new push towards principled vocabulary practices. It’s almost impossible to imagine useful Semantic mapping without well-defined, sustainable vocabularies—with that, we have the potential to move forward without impediment, leaving no parts of the community behind.
Gordon Dunsire (firstname.lastname@example.org) is a freelance consultant.
Corey Harper (email@example.com) is Metadata Services Librarian at New York University.
Diane Hillmann (firstname.lastname@example.org) is a Partner with Metadata Management Associates.
Jon Phipps (email@example.com) is a Partner with Metadata Management Associates.