Endangered But Not Too Late: The State of Digital News Preservation

How well are we preserving the public record of events as recorded by local newspapers, radios, and television stations? A new report laments a situation in news organizations that others in the information community will also recognize. In the absence of archivists or professional librarians on staff, news content is inadequately preserved, as organizations move too hastily “in the race to adapt to ever-shifting demands of digital publishing or caught in the grinding gears of hurried migrations from one content management system to another, often without even realizing what’s lost …

Funded by a grant from The Andrew W. Mellon Foundation, Endangered But Not Too Late: The State of Digital News Preservation sounds a serious alarm about the looming crisis resulting from the pressures on an industry charged with delivering news coverage in a variety of digital formats. Based on interviews and consultations with approximately 40 newsrooms and related organizations, as well as with the Library of Congress and the Center for Research Libraries, the report documents an on-going loss of digital news content, our “first draft of history.” It also documents efforts being made to mitigate such losses in the face of ever-changing technology and the severe financial constraints affecting the news industry.

Briefly, the report found that current inadequacies emerge from a range of variables that will seem very familiar to those working in digital information environments.

  • Newsrooms save some but not all content; what’s generally retained is a version of the final broadcast—mostly text, images, and video.

  • Preservation of the content is primarily for internal use in these organizations; public access is important but is generally outsourced.

  • The top technology challenge for these organizations has been their need to rapidly build support for funneling content into multiple channels, even as those channel technologies have been developing.

  • Web CMS systems, generally built internally by the organization and highly customized, have become the single most important publishing platform for news. Some may be using those systems for purposes of archival storage, but other newsrooms are using digital asset management and media asset management systems. 

  • Regular (even frequent) migration between systems creates additional loss of content.

  • Poor quality metadata, resulting from system migration, poor data conversion, and inadequate training of staff, causes content to become invisible to search systems.

  • Financial considerations in the industry mean that preservation of content may not be a priority. Staff—whether librarians or archivists—that used to serve in such roles have largely been let go. Instead, reliance on information services, such as Newsbank, is a common practice when news organizations require background on stories.

  • The bulk of the news organizations surveyed have no preservation policies at all. 

Speaking directly to decision-makers in these organizations, one of the report’s key concerns expressed by the research team was this:

We found that the degree to which your existing content is accessible and useful depends not only on the technologies used, but also on your policies, if any, about what is saved. Other factors that affect access to content include the workflows used to assemble and store content, the metadata that’s saved with your content—or missing, depending on how it is managed— whether or not you have staff dedicated to preservation work, and how well content translates when you undergo a transition from one technology platform to another, an inevitable fact of life in today’s publishing industry. 

An equally telling quote was this:

It’s important to note here that there are essentially no standards in the industry for the kinds of technology systems used to preserve digital news content. What we found in our interviews is a wide range of systems and approaches, from formal archives to essentially no specific tool or system, only production platforms doing their primary tasks and doubling as the place where content is kept. 

The recommendations included in this extensive, 150-page report cover those steps that may be taken immediately, those that may be handled over the course of a slightly longer timeframe, and finally those requiring a broad industry-wide approach. These recommendations will not be unfamiliar to those working in the information community—complete and accurate metadata that conveys origin and provenance as well as ongoing licensing requirements; investing in digital asset management platforms independent of publishing platforms; and the creation of industry guidelines and best practices for preservation of the public record.

What Preservation Efforts Exist?

The library community is pouring significant efforts into preservation, but it’s rather like putting one’s finger in a hole in the dyke holding back the sea. There are too few entities to effectively preserve the content in appropriate formats and rendering. 

With regard to preservation of the news in PDF file formats, the key quote is:

The Library of Congress collects between 300-350 daily papers from the United States in the PDF format. However, saving PDFs alone is an insufficient answer to preserving born-digital news, since they contain only the print version of stories and leave out broadcast news sources entirely. Anecdotal evidence from this study’s interviews estimates that PDFs of newspaper print editions might contain somewhere between half and three-quarters of the news content represented at a given news organization’s website. While PDFs have the advantages of a standardized format that is well-suited to existing digital preservation models, they lack significant amounts of text and photo content published online, not to mention photo galleries, audio, video, interactive presentations, databases, and more that can be delivered via web pages. 

With regard to archiving of news content published via the web:

Together with research partners such as the GDELT Project, the Internet Archive has built a list of approximately 170,000 URLs from news organization websites in over 200 countries. However, even the Internet Archive doesn’t currently have the resources to find and preserve more than a sampling of those web pages. Even if the IA could preserve every page of news content, copyright laws put restrictions on who can access them and how they can be used or even archived. News organizations are also moving toward placing greater restrictions on public access by using paywalls, meaning anyone who wishes to see their content will need to purchase a subscription. 

Regarding the preservation of television broadcasts, referencing efforts by the likes of the Vanderbilt Television News Archive (VTVNA), the Internet Archive, and the American Archive of Public Broadcasting (AAPB), the report commends the ongoing initiatives by these organizations in preserving and storing digital content, some of which may occupy well above 250 terabytes of storage. 

Where Do We Go From Here?

This report is lengthy, but only because it documents so thoroughly the rationale for the 20 pages of final recommendations that outline a variety of steps that organizations should be embarking upon now to prevent the threat of loss of present-day coverage of events from which others will need to build understanding in the future. 

One recommendation for the long term did jump out from the closing pages of the report. News organizations should be working with universities to create or expand programs and training for digital archivists. Employers need candidates who are able to function in broader roles that blend archiving skills with data analysis, research, training, and strategic functions for technology management. Information professionals with those skills are part of a logical solution to these issues. 

The full report is available for download and is recommended reading for anyone with an interest in ensuring that news coverage is accessible now and in the coming years. 

NISO NOTE: NISO member organizations mentioned in the report include Newsbank, Proquest, Vanderbilt University Libraries, Internet Archive, and the Library of Congress.