OCLC's Linked Data Initiative: Using Schema.org to Make Library Data Relevant on the Web

In June of 2012, OCLC announced the next stage of its Linked Data strategy when it revealed that Schema.org markup had been added to WorldCat.org pages under an open Data Commons license (ODC-BY). This technique provided a platform to present the metadata and holdings for millions of bibliographic items held by tens of thousands of libraries to the large commercial search engines for use in their search indexes and applications.

The Schema.org initiative—made up of Google, Bing, Yahoo, and Yandex—provides a core ontology for search engines and other web crawlers to directly make use of this library data. Schema.org represents a cooperative agreement between these major search engines to share a core vocabulary for markup. It helps the search engines to normalize the markup of webpages in a way that reduces ambiguity about what the pages are describing and makes the integration of the data into search engines more efficient.

OCLC observed this development in the search engine industry and realized that it could be an important tool to more effectively represent the collective collections of libraries on the Open Web. At the same time, OCLC’s internal experiments with linked data were maturing, and the opportunity to combine this new method for exposing data on the Web with the value of library linked data seemed ideal. OCLC enhanced the core bibliographic data exposed on WorldCat.org with Schema.org markup and, following good linked data practice, included Universal Resource Identifiers (URIs) for as many linkable data elements as possible.

In the complex message exchange between a web user’s search and the content to be delivered (library collections in this case), Schema.org markup provides an ideal tool to mediate that complexity and more efficiently connect end users to the content they desire.

How Did We Get Here?

OCLC’s interest in providing structured data suitable for wide consumption goes back to the Dublin Core initiative in 1995 when OCLC hosted a meeting of international experts at its headquarters in Dublin, Ohio to develop a core vocabulary for the description of resources. In 1997, OCLC joined the W3C, and staff in OCLC Research became active participants in the subsequent discussions on how best to represent library data on the Web. The late 2000s saw OCLC begin to experiment with the benefits of exposing library linked data through a series of experimental releases. In 2009, OCLC released the top three levels of Dewey as linked data through Dewey.info. Also in 2009 the Virtual International Authority File (VIAF) was released as linked data. That release was improved in 2010 with a Friend of a Friend (FOAF) model release of VIAF.

The release of VIAF as linked data represented a powerful opportunity to provide durable and authoritative data about authors and titles on the Web in a way that encourages linking to library resources. In 2011, OCLC released the Faceted Subject Terms (FAST) data as linked data to provide a controlled subject vocabulary to the linked data environment. More recently, in 2012, the evolution of Dewey.info moved forward significantly with the release of all levels and captions of the Dewey controlled subject vocabulary.

Point-of-need access for web users drove OCLC’s introduction of WorldCat.org in 2005. By surfacing the collective collection of the world’s libraries and working with partners like Google, Bing, and Yahoo!, millions of web users now have rich library content appearing in their regular workflows. Given the variety of linked data work at OCLC and the goals of WorldCat.org, the Schema.org effort offered a great opportunity for a webscale exercise to bring it all together.

To further improve the representation of library data on the Web, OCLC is working with the Schema.org community to develop and add a set of vocabulary extensions to WorldCat data. Schema.org and library-specific extensions will provide a valuable two-way bridge between the library community and the consumer web.

The Technical Process (The Nuts and Bolts)

Meaningful Schema.org-derived linked data was added to WorldCat.org in three phases.

  1. The OCLC linked data team focused on data modeling necessary to connect existing experimental linked data projects (e.g., VIAF, FASt, LC Authorities, and Dewey) to the Schema.org base vocabulary and created an initial library extension to the vocabulary.
  2. The team experimented with various data models and approaches to apply descriptive, linked data decoration to the bibliographic content on WorldCat.org. the data-intensive nature of this iterative process required technology that handles the variety and volume of data along with the iterative process of setting models, running them against the data, reviewing the results, and adapting the models. these requirements for rapid iteration were addressed by the use of the Apache hadoop software framework, which shortened the data-loading time for hundreds of millions of records from weeks to minutes.
  3. The final stage required the updating and displaying of linked data-decorated WorldCat.org records on the production site for use by web users, partners, and harvesters. the WorldCat.org site is optimized for high- traffic, high-performance use by partners and end users, a critical factor given that these kinds of significant updates result in an increase in harvesting activities and use. the approach used in generating and adding the linked data allows for regular updates to the decoration without significant timing or technical challenges. It is likely that the markup may evolve over the coming months so this release should be considered experimental and subject to change.

Making Library Data Relevant on the Web

The Schema.org activity and associated vocabularies offer a clarified middle ground between rich, very diverse domains on the Web where context does not exist. Web intermediaries like Google, Bing, and Yahoo! focus on interpreting web users’ needs and connecting them to the most appropriate web resources. Using linked data and the Schema.org vocabularies as a starting point, rich domains like libraries, retailers, publishers, governments, and scientists can surface in this webscale interpretation with more context and clearer intent.

Webscale means three things in this exercise for OCLC:

  1. A Large Volume of Data
    Rather than experiment with a subset, apply the markup decoration to more than 250 million bibliographic items in WorldCat.org. the initial decoration included virtual International Authority File (VIAF), Faceted Application of Subject terminology (FASt), Library of Congress Authorities, and Dewey.
  2. Quick, Large-scale Iterations
    The design of the technical and data infrastructures allow for quick iterations and updates to the data. No one “right” way to do this markup exists; therefore, additional vocabularies will be identified, better expressions clarified, and more meaningful connections made over time. oCLC’s data infrastructure and the architecture of WorldCat.org both support this kind of iteration and exposure for the large dataset and associated decorations.
  3. Ongoing, Open Discussions & Community-Based Learning
    Participation in the Schema.org work reflects contemporary web expectations of trying many things to get to a better place overall. Interested parties can learn more about engaging with oCLC, and look at the proposed library vocabulary extension at the Linked data at OCLC webpage. See relevant links below.

The Future

Like all experiments, this project is the basis for further iterations of work, based upon results, to further enhance linked data capabilities in WorldCat data. This work falls into several categories.

Vocabulary:

As previously discussed, the exposure of WorldCat data was approached from the viewpoint of the consumer not familiar with libraries. For a search engine company or a general web consumer, the vocabulary most likely to be generally accepted is Schema.org, (as its markup is already found on some seven percent of pages crawled by Google and Bing). However, as recognized by the development of a library ontology to supplement schema.org markup for WorldCat data, any vocabulary will need to be extended to address the lack of some details.

The Library Ontology:

This is designed as a conversation starter for a recommended extension to Schema.org and not a complete ontology. This conversation should be encouraged and pursued with other organizations and individuals in the library and Semantic Web domains. If a consensus can be formed around this proposal, there is a good chance that the W3C-backed group behind Schema.org will accept it. If accepted, it will benefit everyone on the Web by providing structured library data. Libraries will benefit by being able to more broadly share information about their resources.

Access to Data:

RDFa embedded in HTML is only one way of providing access to WorldCat as linked data. The use of content negotiation to deliver this data as RDF, in formats such as JSON, RDF/XML, Turtle, etc., is one way to investigate the delivery of this data. Scraping the RDF from the content of a WorldCat webpage, although powerful, is not the ideal access method for all circumstances. In the coming months the best ways to provide access to this data will be explored; this will include talking to potential data consumers and identifying services that can be provided around it.

Productization:

By definition this experiment is designed to improve the core part of the production service that looks after WorldCat. As the ways of describing and providing access to WorldCat linked data evolve, work to enhance the OCLC infrastructure to implement this, as part of normal processes, will occur. With the size and uses of WorldCat, and the number of processes in place to add and maintain data within it, this is not an insignificant task, but the potential benefits make it one worth undertaking. As things iterate from this experiment, there will be as much work behind the scenes as will be visible on the surface.

A Linked View of the World:

There have been implicit linkages held in WorldCat data for years, as demonstrated by links to VIAF, FAST, Dewey, Library of Congress and other authoritative resources that have been surfaced by this experiment. Making these links explicit, identifiable, and accessible will open up potential for new services and new ways of thinking about the process of creating, managing, and sharing data. Work is underway to identify other links that could be exposed to more authoritative sources. Suggestions for more links and ways to map to them are encouraged.

As other contributions in this ISQ issue indicate, linked data is entering the vocabulary, and practice, of many in the metadata community. This experiment represents a strong commitment from OCLC toward the debate around the best ways forward and the potential benefits of linked data. Join the conversation, as we iterate forward from this initial experimental step. If you have questions or comments about what we have done or might do next, please contact us at data@oclc.org.

Ted Fons (fonst@oclc.org) is executive Director, Data Services & WorldCat Quality at OCLC.

Jeff Penka (penkaj@oclc.org) is Director/global Product manager, QuestionPoint Services at OCLC.

Richard Wallis (richard.wallis@oclc.org) is technology evangelist in OCLC’s Birmingham, UK office.

Footnotes

ApacheTM HadoopTM
hadoop.apache.org/

dewey.info
http://dewey.org/

Experimental “library” extension vocabulary for use with schema.org
purl.org/library/

Faceted Application of subject terminology (FAST)
www.oclc.org/research/activities/fast/

Library of Congress Authorities
authorities.loc.gov/

Linked Data at OCLC
www.oclc.org/data.html

Open Data Commons License (ODC-BY)
opendatacommons.org/licenses/by/

schema.org
http://schema.org/

Virtual International Authority File (VIAF)
www.oclc.org/viaf/

W3C Semantic Web Linked Data Specifications
www.w3.org/standards/semanticweb/data

WorldCat®
www.worldcat.org