Home | About NISO | Blog

Archive for the ‘preservation’ Category

Introduction to NISO webinar on ebook preservation

Wednesday, May 23rd, 2012

Below are my welcoming remarks to the NISO webinar on Heritage Lost?: Ensuring the Preservation of Ebooks on May 23rd.

“Good afternoon and welcome to the second part of this NISO Two-Part Webinar on Understanding Critical Elements of E-books: Acquiring, Sharing, and Preserving.  This part is entitled Heritage Lost? Ensuring the Preservation of E-books.

Perhaps it is due to the fact that electronic journals were adopted much earlier and more rapidly, that we are more familiar with the archiving and preservation of e-journal content than e-book content. However, just as it did in the late 1990s after e-journals became prevalent, so too the topic of preservation of e-books is now rising up in the minds of people deeply concerned with the long-term preservation of cultural materials.

That is not to say that no one is considering these issues.  Some of the bigger digitization projects involve libraries and as such include preservation as part of their mission.  I’m thinking in particular about the Internet Archive, Portico and the HaithiTrust in this regard, but there are certainly others.  Today we’ll here from two of these groups and what they are doing to support

Another big preservation issue that is frequently overlooked is the model of distribution that many publishers are moving toward, which is a license model rather than a sale model.  I won’t get into either the legal or business rationale for this shift, but I do want to focus on this shift’s implications for preservation and in particular publishers.  An important analogy that I make to publishers is that of renting a house versus selling a house.  When a publisher sells a house (in this case a book), it passes on all the responsibility for the house and it’s upkeep onto the new owner.  Now if a person rents that same house, the responsibility for fixing the leaking roof, for painting the walls and repairing the broken windows generally falls back to the landlord who is renting the house.  Obviously, there is money to be made and the terms of the lease impact who is responsible for what, but in general, the owner is still the primary person responsible for the major upkeep of the house.

In the case of the sale of a book, the publisher is no longer responsible for that item and its preservation onto the new owner, say the library.  It is then up to the library to ensure that the book doesn’t fall apart, that the cover stays clean, or the pages don’t rip.  However, as we move to a license environment, the long-term responsibility of upgrading file formats, of continuing to provide access and functionality falls back to the publisher.  The publisher is the landlord, renting e-books to the publishing community.  And this responsibility requires a great deal more effort than simply hosting the file.  The publishers will eventually need to repaint, to refurbish, to fix the broken plumbing to speak on this digital collection.  I expect that this will be no small feat, and something that few publishers are prepared to address.

The Library of Congress has begun thinking about this problem from the perspective of their demand deposit requirement related to copyright registration for LC’s own collection.  While they are at the moment focused on electronic-only journals, one can envision a scenario where electronic-only books are not that far away.  LC has not explicitly discussed e-book preservation and their current work is only focused on e-journals.  However, the problems that LC is facing is illustrative of the larger issues that they likely will face.  There are standards for journal article formatting using XML, such as the soon to be released Journal Article Tag Suite or (JATS), formerly the NLM DTD.  This project developed by the National Library of Medicine in the US was specifically focused on developing an archival tagging model for journal article content distribution and preservation.  There is no similar model for books that is widely adopted.  If the variation of journal markup is significant, the same complexity for book content is some exponential increase over that.

No archive can sustain a stream of ingest from hundreds or thousands of publishers without standards.  It is simply unmanageable to accept any file in any format from thousands of publishers.    And this is of course, where standards comes in, although this isn’t the forefront of either of our presentations today, it does sit there in the not so distant background.

And there has been a great deal of focus over the past year on the adoption of the new EPUB 3.0 specification. This is a great advancement and it will certainly help speed adoption of e-books and their overall interoperability with existing systems.  However, it should be clear that EPUB is not designed as an archival format.  Many of the things that would make EPUB 3 archival exist within the structure but their inclusion by publishers is optional, not mandatory.  In the same way that accessibility and archiving functionality is possible within PDF files, but it is functionality that most publishers don’t take advantage of or implement.  We as a community, need to develop profiles of EPUB for preservation that publishes can target, if not for their distribution, at least for their long-term preservation purposes both internally and externally.

It will be a long-term project that we will be engaged in.  And it is something that we need to focus concerted attention on, because preservation isn’t the first thing on content creator’s minds.  However, we should be able to continue to press the issue and make progress on these issues.

Mandatory Copyright Deposit for Electronic-only Materials

Thursday, April 1st, 2010

In late February, the Copyright Office at the Library of Congress published a new rule that expands the requirement for the mandatory deposit to include items published in only in digital format.   The interim regulation, Mandatory Deposit of Published Electronic Works Available Only Online (37 CFR Part 202 [Docket No. RM 2009–3]) was released in the Federal Register.  The Library of Congress will focus its first attention on e-only deposit of journals, since this is the area where electronic-only publishing is most advanced.  Very likely, this will move into the space of digital books as well, but it will likely take sometime to coalesce.

I wrote a column about this in Against the Grain last September outlining some of these issues that this change will require.  A free copy of that article is available here.  The Library of Congress is aware, and will become painfully more so when this stream of online content begins to flow their way.  To support an understanding about these new regulations, LC hosting a forum in Washington in May to discuss publisher’s technology for providing these data on a regular basis.  Below is the description about the meeting that LC provided.

Electronic Deposit Publishers Forum
May 10-11, 2010
Library of Congress — Washington, DC

The Mandatory deposit provision of the US Copyright Law requires that published works be deposited with the US Copyright Office for use by the Library of Congress in its collection.  Previously, copyright deposits were required only for works published in a physical form, but recently revised regulations now include the deposit of electronic works published only online.  The purpose of this workshop is to establish a submission process for these works and to explore technical and procedural options that will work for the publishing community and the Library of Congress.

Discussion topics will include:

  • Revised mandatory deposit regulations
  • Metadata elements and file formats to be submitted

Space for this meeting is very limited, but if you’re interested in participating in the meeting, you should contact the Copyright Office.

  • Proposed transfer mechanisms
  • The Memento Project – adding history to the web

    Wednesday, November 18th, 2009

    Yesterday, I attended the CENDI/FLICC/NFAIS Forum on the Semantic Web: Fact or Myth hosted by the National Archives.  It was a great meeting with an overview of ongoing work, tools and new initiatives.  Hopefully, the slides will be available soon, as there was frequently more information than could be expressed in 20-minute presentations and many listed what are likely useful references for more information.  Once they are available, we’ll link through to them.

    During the meeting, I had the opportunity to run into Herbert Van de Sompel, who is at the Los Alamos National Laboratory.  Herbert has had a tremendous impact on the discovery and delivery of electronic information. He played a critical role in creating the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), the Open Archives Initiative Object Reuse & Exchange specifications (OAI-ORE), the OpenURL Framework for Context-Sensitive Services, the SFX linking server, the bX scholarly recommender service, and info URI.

    Herbert described his newest project, which has just been released, called the Memento Project. The Memento project proposes a “new idea related to Web Archiving, focusing on the integration of archived resources in regular Web navigation.”  In chatting briefly with Herbert, the system uses a browser plug-in to view the content of a page from a specified date.  It does this by using the underlying content management system change logs to recreate what appeared on a site at a given time.  The team has also developed some server-side Apache code that handles the request for calls to the management of systems that have version control.  The system can also point to a version of the content that exists in the Internet Archive (or other similar archive sites) for content from around that date, if the server is unable to recreate the requested page. Herbert and his team have tested this using a few wiki sites.  You can also demo the service from the LANL servers.

    Here is a link to a presentation that Herbert and Michael Nelson (co-collaborator on this project) at Old Dominion University gave at the Library of Congress on this project.  There was also a story about this project  A detailed paper that describes the Memento solution is also available on the arXive site.  There is also an article on Memento in the New Scientist.  Finally, tomorrow (November 19, 2009 at 8:00 AM EST), there will be a presentation on this at OCLC as part of their Distinguished Seminar Series, which will be available online for free (RSVP required).

    This is a very interesting project that addresses one of the key problems with archiving web page content, which frequently changes.  I am looking forward to the team’s future work and hoping that the project gets some broader adoption.

    Trust but verify: Are you sure this document is real?

    Tuesday, November 3rd, 2009

    Continuing on the theme of a “leaked” document that was posted last week from a systems supplier in the community.  One thing that few asked initially regarding this document is: “Is it real?”  In this case, not 24 hours after the document was “released”, it was confirmed by the author that he had written the document and that it had been circulating for some time. However, it is amazing the stir that can be started by posting a PDF document anonymously on the Wikileaks website, regardless of its provenance.

    Last week was the 40th anniversary of the “birth” of the internet, when two computers were first connected using a primitive router and transmitted the first message from two computers: “Lo”.  They were trying to send the command “Login”, but the systems crashed before the full message was sent. Later that evening, they were able to get the full message through and with that the internet – in a very nascent form was born.  During a radio interview that week, Dr. Leonard Kleinrock, Professor of Computer Science, UCLA, who was a one of the scientists that was working on those systems that night, spoke about the event.  During one of the questions, Dr. Klenirock was asked about the adoption of IP Version 6. His response was quite fascinating:

    Dr. KLEINROCK: Yes. In fact, in those early days, the culture of the Internet was one of trust, openness, shared ideas. You know, I knew everybody on the Internet in those days and I trusted them all. And everybody behaved well, so we had a very easy, open access. We did not introduce any limitations nor did we introduce what we should have, which was the ability to do strong user authentication and strong file authentication. So I know that if you are communicating with me, it’s you, Ira Flatow, and not someone else. And if you send me a file, I receive the file you intended me to receive.

    We should’ve installed that in the architecture in the early days. And the first thing we should’ve done with it is turn it off, because we needed this open, trusted, available, shared environment, which was the culture, the ethics of the early Internet. And then when we approach the late ‘80s and the early ‘90s and spam, and viruses, and pornography and eventually the identity theft and the fraud, and the botnets and the denial of service we see today, as that began to emerge, we should then slowly have turned on that authentication process, which is part of what your other caller referred to is this IPV6 is an attempt to bring on and patch on some of this authentication capability. But it’s very hard now that it’s not built deep into the architecture of the Internet.

    The issue of provenance has been a critical gap in the structure of the internet from the very beginning.  At the outset, when the number of computers and people who were connected to the network was small, the issue of authentication and validation were significant barriers to a working system.  If you know and trust everyone in your neighborhood, locking your doors is an unnecessary hassle.  In a large city, where you don’t know all of your neighbors, locking your doors is a critical routine that becomes second nature.  In our digital environment, the community has gotten so large that locking doors, authenticating and passwords to ensure you are who you claim to be is essential to a functioning community.

    Unfortunately, as Dr. Kleinrock notes, we are in a situation where we need to patch some of the authentication and provenance holes in our digital lives.  This brings me back to the document that was distributed last week via Wikileaks.

    There is an important need, particularly in the legal and scientific communities that provenance be assured.  With digital documents, which are easily manipulated or created and distributed anonymously, confirming the author and source of a document can be.  Fortunately, in this case, the authorship can be and was confirmed easily and quickly enough.  However, in many situations this is not the case, particularly for forged or manipulated documents.  Even when denials are issued, there is no way to prove the negative to a doubtful audience.

    The tools for creating extremely professional looking documents are ubiquitous.  Indeed, the same software that most publishers companies use to create formal published documents is available to almost anyone with a computer.  It would not be difficult to create one’s own “professional” documents and distribute them as real.  The internet is full of hoaxes of these sorts and they run the gamut from absurd, to humorous, to quite damaging.

    There have been discussions about the need for better online provenance information for nearly two decades now. Some work on metadata provenance is gaining broader adoption including PREMIS, METS and DCMI, some significant work on standards remains regarding the authenticity of documents.  The US Government and the Government Printing Office has made progress with the GPO Seal of Authenticity and digital signature/public key technology in Acrobat v. 7.0 & 8.0.  In January, 2009, GPO digitally signed and certified PDF files of all versions of Congressional bills introduced during the 111th and 110th Congresses. Unfortunately, these types of authentication technologies have not been broadly adopted outside the government.  The importance of provenance metadata was also re-affirmed in a recent Arizona Supreme Court case.

    Although it might not help in every case, knowing the source of a document is crucial in assessing its validity.  Until standards are broadly adopted and relied upon, a word of warning to the wise about content on the Internet: “Trust but verify.”

    Magazine publishing going digital only — PC Magazine to cease print

    Wednesday, November 19th, 2008

    Another magazine announced today that they will cease publication of a print edition. In an interview with the website PaidContent.org, the CEO of Ziff Davis Jason Young, announced that PC Magazine will cease distribution of their print edition in January.

    PC Magazine is just one of several mass-market publications that are moving to online only distribution. Earlier this week, Reuters reported that a judge has approved the reorganization of Ziff Davis, which is currently under Chapter 11 bankruptcy protection. There was some speculation about the future of Ziff Davis’ assets.

    From the story:

    The last issue will be dated January 2009; the closure will claim the jobs of about seven employees, all from the print production side. None of the editorial employees, who are now writing for the online sites anyway, will be affected.

    Only a few weeks ago, the Christian Science Monitor announced that it would be ending print distribution. The costs of producing and distributing paper has always been a significant expense for publishers and in a period of decreasing advertising revenues, lower circulation, and higher production costs, we can expect that more publications will head in this direction.

    Within the scholarly world, in particular, I expect that the economics will drive print distribution to print-on-demand for those who want to pay extra, but overall print journals will quickly become a thing of the past. I know a lot of people have projected this for a long time. ARL produced an interesting report written by Rick Johnson last fall on this topic, but it appears we’re nearing the tipping point Rick described in that report.

    This transition makes all the more critical the ongoing work on preservation, authenticity, reuse, and rights particularly as they relate to the differences between print and online distribution.

    EU Research Data Preservation Project Seeks Survey Input from Publishers

    Tuesday, November 11th, 2008

    PARSE.Insight, a European Union project initiated in March 2008 “to highlight the longevity and vulnerability of digital research data,” is conducting an online survey about access and storage of research data.

    PARSE.Insight is “concerned with the preservation of digital information in science, from primary data through analysis to the final publications resulting from the research. The problem is how to safeguard this valuable digital material over time, to ensure that it is accessible, usable and understandable in future.”

    They are interested in getting publishers’ views included in their survey, in addition to researchers, since publishers play a critical role in the digital preservation of publications and related research data.

    The survey is available here:
    https://www.surveymonkey.com/s.aspx?sm=VfIpOoxogOv73uWOyaOhoQ_3d_3d

    Reponses are aggregated for analysis and made anonymous. If you wish to be informed about the results of the survey you can enter your e-mail address at the end of the survey.

    Ultimately, PARSE.insight plans to “to develop a roadmap and recommendations for developing the e-infrastructure in order to maintain the long-term accessibility and usability of scientific digital information in Europe.”

    Posted by Cynthia Hodgson

    Amazing digital conversion presentation at Code4Lib

    Wednesday, February 27th, 2008

    I am sitting at the Code4Lib meeting in Portland, and I’ve just seen an amazing presentation by Andrew Bullen a librarian and programmer at the Illinois State Library.  Taking scanned digital images of sheet music in the Pullman archive sheet music collection and using some music translation software, outputting MIDI formats, he’s output some piano music.  Using the acousitc profile of a local mansion/hotel, owned by the Pullman family, he’s created an output mp3 file of the results.  Not knowing how to read music, or how to play piano, he’s created a fantastic audio translation of the sheet music.  Here is a link to the video.  It is incredible.  Well done, Andrew!