Europeana is the European Union’s flagship digital cultural heritage initiative. the europeana portal, launched in November 2008, showcases the possibility of cross-cultural domain interoperability on a pan-European level. To date, metadata and thumbnails for over 23 million objects have been aggregated from over 1500 providers from the library, archive, museum, and audiovisual domains. offering simple and advanced search functionality, or browsing using various parameters, users can link from the representations of the objects held in the portal to the source objects held at the provider institutions.
From the outset Europeana was conceived as more than just a huge data repository fronted by a portal application. It was hoped that cultural heritage communities were ready to think outside the traditional information silos and adopt a linked data paradigm that would enable the development of shared semantic context. Linked Data is a data publishing technique that uses common web technologies to connect related data and make them accessible on the Web. Linked Open Data implies that reuse restrictions have been removed from the metadata. Moving to such a model may mean that in future the portal is seen as the reference application of Europeana but that its main function is that of a rich data service that allows third parties to take the data freely and re- use it to create new knowledge and applications.
This is an ambitious goal, requiring a change of perspective on the part of the guardians of cultural heritage resources: “This mentality shift is a big leap, since it requires cultural heritage institutions to think, not primarily within the boundaries of their particular collections, but in terms of what these collections might add to a bigger, complex and distributed information continuum coupled with various contextual resources.”[Concordia] Every aspect of this ambition offers major challenges: technical, legal, policy level, linguistic, financial, etc. All this has meant that a fully linked open data position could only be achieved in an iterative fashion and by keeping providers involved at all times.
As a technical starting point, Europeana carried out a linked open data pilot project. The remainder of this paper gives an outline of the processes of the pilot project and areas for future work.
From Local Data Standards to the Europeana Semantic Element Set
The shift required to get from individual data silos with curated data to the distributed information continuum enabled by linked open data was not going to be accomplished in one giant leap for mankind. Providers use many different metadata formats ranging from well- known, sophisticated, internationally-maintained standards to home-grown formats that had evolved in one institution over time. Coming from different cultural heritage sectors, these standards also embody different views of the resources they curate: for example, event-based versus object-centric descriptions. The first step was therefore to demonstrate the possibility of interoperability between the silos and, in parallel, to start examining the many other aspects already mentioned. Not the least of these, but not covered in this paper, was to develop trust between partners and build a community with a common understanding of the aims of the enterprise, including the key concept of “open data.”
To achieve some measure of data interoperability, a common dataset was defined to which all participating providers could map a reasonably useful set of metadata. This initial metadata schema is the Europeana Semantic Elements (ESE). This is essentially a Dublin Core application profile: an element set based on a subset of the Dublin core elements with several Europeana-specific fields added to support specific portal functionality. The documentation giving information on ESE can be found in the “Technical requirements” section of the Europeana website. This schema is the current metadata format used in the European production system.
As a basic solution to Europeana interoperability problems, ESE suffers from many issues. First, it is a “flat” model that aggregates in one and the same record metadata fields that can apply to different entities. This breaks the “one-to-one” principle and causes great confusion; for example, some providers use rights- or date-related fields to give information for the “real-world” resources they hold, while others use the same fields for data about the digital resource that represents these items. Significantly for a linked data approach, most of the data provided contains simple string values for the metadata fields. Linked data depends on resources being identified with (HTTP) Uniform Resource Identifiers (URIs) in order to create the links. Simple string values prevent properly linking an item ingested by Europeana to other objects (e.g., a series of portraits), or to contextual entities as represented by complex resources, e.g., a creator with many name variations, or a broader concept that is part of an online thesaurus, all of which could help improving access to Europeana items.
From ESE to the Europeana Data Model (EDM)
The RDF-based Europeana Data Model (EDM) was developed by the Europeana community as an alternative to the ESE schema and aimed at solving the shortcomings mentioned. The development process took full account of Europeana’s firm belief in the benefits of Semantic Web and Linked Data technology for the culture sector, which have been articulated in the reports of the W3C Library Linked Data Incubator Group. It is a more flexible and precise model than ESE which offers the opportunity to attach every statement to the specific resource it applies to and also reflects some basic form of data provenance.
The main requirements identified for the development of EDM included:
Distinguish between a “provided item” (painting, book) and digital representations.
Distinguish between an item and the metadata record describing it.
Allow ingesting multiple records for the same item, containing potentially contradictory statements about it.
Provide support for contextual resources, including concepts from controlled vocabularies.
By providing the mechanism to distinguish all these aspects of a resource, EDM allows the representation of different perspectives on a given cultural object. It also enables the representation of complex—especially hierarchically structured—objects, as in the archive or library domains. Finally, it allows the representation of contextual information, in the form of entities (places, agents, time periods) explicitly represented in the data and connected to a cultural object.
Rather than systematically introducing new elements, EDM reuses and links to existing reference vocabulary elements, such as the Open Archives Object Reuse and Exchange Model (OAI-ORE), Dublin Core, and the W3C SKOS model for Knowledge Organization Systems. These various features are fully described in the EDM Primer on the Europeana Professional website.
The Linked Data Pilot – data.europeana.eu
As mentioned earlier, many issues stood in the way of the immediate adoption of Linked Open Data (LOD) in the Europeana production system:
- Lack of metadata expressed in EDM
- Missing links to other sources
- The absence of data provider agreements explicitly permitting the release of the metadata into the public domain.
As a proof of concept, a Europeana Linked Data Pilot was built at data.europeana.eu. It is technically de-coupled from the Europeana production system and allows those data providers, who want to make their data available as Linked Open Data, to opt for their metadata to be openly published on the Web.
The overall approach is shown in Figure 1 and an outline description follows. A fuller technical description can be found in the paper produced by Haslhofer and Isaac in 2011.
Extract the subset of ESE XmL metadata that had been submitted by the providers who had expressed the wish to become part of the pilot.
Convert the ESE data to EDM using the mapping that had been defined. this mapping also covered the creation of the EDM entities (items, aggregations, proxies), the assignment of dereferencable HTTP URL identifiers to these entities, as well as the attachment of the relevant metadata fields to each new entity. the mapping between ESE and EDM is implemented in an XML stylesheet. The result is an RDF/ XML representation of each data provider’s metadata.
Two strategies are followed for linking data.europeana.eu resources with other web resources:
Semantic enrichment data that is created by Europeana, after it has ingested metadata from its data providers, is fetched. this data consists of links to four types of reference resources: geoNames for places, gemet for general topics, the Semium time ontology for time periods, and DBpedia for persons, currently generating over four million links. Since the enrichments are links, they perfectly fit EDM and the Linked Data approach.
A simple ad-hoc linking strategy whereby existing resource identifiers that are part of the metadata are used to create links to other Linked open Data services that hold information about objects that are also served by data. For the time being, this only concerns the Swedish cultural heritage aggregator (SoCh).
Data dumps were generated from the resulting RDF/ XML files together with the supplied/generated links. These are then made available as dump files and also ingested into an RDF store. Incoming HTTP requests are answered either by the RDF store (if they have an RDF-specific Internet media type in the HTTP Accept header field) or redirected to the Europeana portal (for standard HTML requests).
EDM modeling patterns
Figure 2 shows the basic structures of EDM networked resources after the flat ESE data is transformed into EDM for the linked data pilot.
The following sections explain each resource further and indicate the properties that should be attached to their instances.
Item (Provided Cultural Heritage Object)
Item resources (typed as Provided Cultural heritage object (Cho)) represent objects (painting, book, etc.) for which institutions provide representations to be accessed through europeana. Provided Cho UrIs are the main entry points in data.europeana.eu. A Provided Cho is the hub of the network of relevant resources and, when applicable, will link out to other linked data resources about the same object via owl:sameAs statements. In the pilot, no descriptive metadata (creator, subject, etc.) is directly attached to object UrIs. It is instead attached to the proxies that represent a view of the object, from a specific institution’s perspective (either a provider or europeana itself).
Proxies originate from the oAI-ore model and are used to separate the item itself from the descriptive statements (creator, subject, date, etc., mostly coming from eSe’s Dublin Core fields) for the item, which are contributed by a provider. they enable the separation of different views of the same item that may be the focus of multiple aggregations from different providers. In every case, there will be one proxy for the provider descriptive data for an item and another for the data created by europeana.
These resources provide data related to a provider’s gathering of digitized representations and descriptive metadata for an item. they are related to digital resources about the item, be they files directly representing it or webpages showing the object in context. they may also provide controlled rights information applying to these resources. Finally, provenance data is given in statements using the specific eDm properties.
europeana proxies are the second type of proxies served at data.europeana.eu. they provide access to the metadata created by europeana for a given item, distinct from the original metadata from the provider. here, one can find statements indicating a normalized date associated with the object. these proxies also have statements that link them to places, concepts, persons, and periods from external datasets, as mentioned earlier.
A europeana aggregation bundles together the result of all data creation and aggregation efforts for a given item—the provider’s and europeana’s own. It aggregates the provider’s aggregation, which in turn will connect to the provider’s proxy. Not shown in the diagram, but linked to the provider aggregation, are the digitized resources europeana.eu serves for the item.
issues and Future Work
Achieving Fully Open Data
When the results of the linked data pilot were first launched in June 2011, it contained 3.5 million objects taken from the datasets of volunteer institutions. These could not be released under fully open terms due to the evolving understanding of the Data Exchange Agreement. In February 2012, a second version of data.europeana.eu was released that, although still a pilot, now contains fully open metadata (CC0 – public domain dedication). It has a smaller but still substantial subset of cultural heritage data, at 2.4 million objects, but this must be seen in context: the qualitative step of having fully open publication is crucial to Europeana and forms the basis of an active advocacy campaign to persuade more of the community to open their data for the benefit of end users. In order to make the message more accessible to the public, an animation explaining the connection between linked data technology and open data policies has been released. A virtuous circle is envisaged in which third parties use the open data to develop innovative applications and services, which in turn stimulates end users’ interest in digitized heritage, and this, in turn, demonstrates to cultural heritage institutions the value of releasing more open data.
improving Connectivity of the Data
Source data in Europeana is of varying degrees of richness and is all mapped to ESE, which is based on simple text string values. While achieving interoperability, this often entails losing some of the richness of the more detailed formats. In particular, it means any provider that has used contextual resources (authority files, thesauri, etc.) will have lost those relationships. In the context of the linked data pilot this means that internal connectivity is very low. Linkage exists between the provided CHO—aggregation— proxy resources that come with the EDM model, but no “semantic” links between the items or the proxies that represent them. Ideally, many provider contextual resources could be fed into Europeana together with the object metadata and provide internal links. This includes, among others, concepts from shared domain thesauri or place resources, which are already used in the description for different objects in a collection or even across collections. Publishing the data together with its companion thesaurus and authority file has already been demonstrated in the Amsterdam Museum Linked Open Data prototype. Europeana is currently working on this and there are case studies on the Europeana Professional website that show how it can be done.
For achieving external connectivity, Europeana’s enrichment process is used and this generates semantic links from specific fields in the ESE data. Because it has to deal with very heterogeneous collections, Europeana is bound, for the moment, to use simple data enrichment techniques despite the associated errors. Improving the way the creation of enrichment values is handled would improve this situation, however. In addition, alignments will be extended to other resources such as the Virtual International Authority File (VIAF) and other relevant initiatives from the community.
An important Europeana requirement is to communicate meta-level information about the data it publishes. Europeana’s mission includes becoming a trusted source of information, while encouraging more open data circulation in the culture sector. To this end, provenance and licensing information are crucial—whether about the cultural items being accessed or about the metadata on these items.
Linked Data technology still lacks a fully standardized suite to express such meta-level information so various existing solutions (based on OAI-ORE resource maps) were combined to supply the required licensing and provenance data for the present. This choice is similar to other institutions, for example, the New York Times’ linked data service. Relevant ongoing efforts (the W3C Provenance Working Group and DCMI’s Provenance Task Group) are being followed in the hope of adopting a fully consensual approach in the future.
Complexity and Navigability
The requirements for EDM to distinguish different data sources and apply the data precisely to different resources results in the creation of complex networks of aggregations, proxies, and other resources. This has many benefits but it also raises the barrier to data access and consumption. As well as adding extra complexity to the RDF graphs published, the proxy pattern is counter-intuitive for linked data practitioners. It causes confusion in particular when finding statements about something (for example, a painting) that are attached to a resource that is not, strictly speaking, standing for that painting (i.e., the proxy). The temptation was to simplify the task for the linked data consumers by duplicating the statements attached to the proxies onto the “main” resource for the provided item thereby allowing direct access to the statements, i.e., not mediated through proxies. Although this was not done, it may still happen in response to feedback from data consumers. In the longer term, it is hoped that W3C will standardize named graphs for RDF thereby allowing EDM to meet the requirement to track provenance without the need for proxies.
HTTP URL Design
The transition from Europeana URIs to dereferencable HTTP URIs for EDM aggregations and proxies was a major challenge in the conversion process. The main Europeana production system and the Europeana Linked Open Data Prototype are still two distinct systems so a bridge was needed between the identification mechanisms in place. Europeana’s local identifiers were therefore used for the dereferencable URIs in the Europeana LOD prototype. This resulted in persistence difficulties when collections were reharvested. Both the LOD infrastructure and the underlying Europeana identification mechanism will have to find better strategies in the future.
Integration with other Data
A further area of future work will be the compatibility between Europeana data and other initiatives that promote the availability of structured metadata on the Web, such as schema.org. This should increase the visibility of Europeana on the Web.
Data modeling and description practices differ across the cultural heritage sector, varying in levels of granularity, focus of interest, use of standards, and application of vocabularies. It was important that the solution chosen by Europeana should reuse existing standards and be flexible enough in its approach to interoperability to allow their co- existence with custom ones from across the sector. Because Europeana wants to reuse and be reused, a web-based open technology was ideal to make it simple to connect data together and share it. Such semantic web and linked data technologies directly relate to open data strategies.
The Europeana linked data pilot produced a body of open metadata represented in the EDM. This allows the representation of different perspectives and basic provenance information on any given cultural object. It is anticipated that future data.europeana.eu dataset releases will reflect the lessons learned with respect to the model’s complexity, dealing with provenance and increasing Europeana’s internal and external connectivity. Key contributions will include applying more semantic enrichment techniques and aggregating richer EDM metadata from data providers instead of flat ESE records.
Europeana is a strong advocate of the benefits of semantic web and linked data technology for the culture sector and the associated opportunities for opening data for imaginative reuse by third party developers and end users. By developing its data model based on these principles and producing this pilot set of data, the foundations are in place for building a shared semantic context for cultural heritage data.
Antoine Isaac (email@example.com) is Scientific Coordinator at Europeana, The Hague, The Netherlands.
Robina Clayphan (firstname.lastname@example.org) is Interoperability manager at Europeana Foundation.
Bernhard Haslhofer (bernhard.haslhofer@cornell. edu) is Fellow Postdoc Associate (Marie Curie) at Cornell University.