Home | About NISO | Blog

Archive for the ‘Library of Congress’ Category

Introduction to NISO webinar on ebook preservation

Wednesday, May 23rd, 2012

Below are my welcoming remarks to the NISO webinar on Heritage Lost?: Ensuring the Preservation of Ebooks on May 23rd.

“Good afternoon and welcome to the second part of this NISO Two-Part Webinar on Understanding Critical Elements of E-books: Acquiring, Sharing, and Preserving.  This part is entitled Heritage Lost? Ensuring the Preservation of E-books.

Perhaps it is due to the fact that electronic journals were adopted much earlier and more rapidly, that we are more familiar with the archiving and preservation of e-journal content than e-book content. However, just as it did in the late 1990s after e-journals became prevalent, so too the topic of preservation of e-books is now rising up in the minds of people deeply concerned with the long-term preservation of cultural materials.

That is not to say that no one is considering these issues.  Some of the bigger digitization projects involve libraries and as such include preservation as part of their mission.  I’m thinking in particular about the Internet Archive, Portico and the HaithiTrust in this regard, but there are certainly others.  Today we’ll here from two of these groups and what they are doing to support

Another big preservation issue that is frequently overlooked is the model of distribution that many publishers are moving toward, which is a license model rather than a sale model.  I won’t get into either the legal or business rationale for this shift, but I do want to focus on this shift’s implications for preservation and in particular publishers.  An important analogy that I make to publishers is that of renting a house versus selling a house.  When a publisher sells a house (in this case a book), it passes on all the responsibility for the house and it’s upkeep onto the new owner.  Now if a person rents that same house, the responsibility for fixing the leaking roof, for painting the walls and repairing the broken windows generally falls back to the landlord who is renting the house.  Obviously, there is money to be made and the terms of the lease impact who is responsible for what, but in general, the owner is still the primary person responsible for the major upkeep of the house.

In the case of the sale of a book, the publisher is no longer responsible for that item and its preservation onto the new owner, say the library.  It is then up to the library to ensure that the book doesn’t fall apart, that the cover stays clean, or the pages don’t rip.  However, as we move to a license environment, the long-term responsibility of upgrading file formats, of continuing to provide access and functionality falls back to the publisher.  The publisher is the landlord, renting e-books to the publishing community.  And this responsibility requires a great deal more effort than simply hosting the file.  The publishers will eventually need to repaint, to refurbish, to fix the broken plumbing to speak on this digital collection.  I expect that this will be no small feat, and something that few publishers are prepared to address.

The Library of Congress has begun thinking about this problem from the perspective of their demand deposit requirement related to copyright registration for LC’s own collection.  While they are at the moment focused on electronic-only journals, one can envision a scenario where electronic-only books are not that far away.  LC has not explicitly discussed e-book preservation and their current work is only focused on e-journals.  However, the problems that LC is facing is illustrative of the larger issues that they likely will face.  There are standards for journal article formatting using XML, such as the soon to be released Journal Article Tag Suite or (JATS), formerly the NLM DTD.  This project developed by the National Library of Medicine in the US was specifically focused on developing an archival tagging model for journal article content distribution and preservation.  There is no similar model for books that is widely adopted.  If the variation of journal markup is significant, the same complexity for book content is some exponential increase over that.

No archive can sustain a stream of ingest from hundreds or thousands of publishers without standards.  It is simply unmanageable to accept any file in any format from thousands of publishers.    And this is of course, where standards comes in, although this isn’t the forefront of either of our presentations today, it does sit there in the not so distant background.

And there has been a great deal of focus over the past year on the adoption of the new EPUB 3.0 specification. This is a great advancement and it will certainly help speed adoption of e-books and their overall interoperability with existing systems.  However, it should be clear that EPUB is not designed as an archival format.  Many of the things that would make EPUB 3 archival exist within the structure but their inclusion by publishers is optional, not mandatory.  In the same way that accessibility and archiving functionality is possible within PDF files, but it is functionality that most publishers don’t take advantage of or implement.  We as a community, need to develop profiles of EPUB for preservation that publishes can target, if not for their distribution, at least for their long-term preservation purposes both internally and externally.

It will be a long-term project that we will be engaged in.  And it is something that we need to focus concerted attention on, because preservation isn’t the first thing on content creator’s minds.  However, we should be able to continue to press the issue and make progress on these issues.

Mandatory Copyright Deposit for Electronic-only Materials

Thursday, April 1st, 2010

In late February, the Copyright Office at the Library of Congress published a new rule that expands the requirement for the mandatory deposit to include items published in only in digital format.   The interim regulation, Mandatory Deposit of Published Electronic Works Available Only Online (37 CFR Part 202 [Docket No. RM 2009–3]) was released in the Federal Register.  The Library of Congress will focus its first attention on e-only deposit of journals, since this is the area where electronic-only publishing is most advanced.  Very likely, this will move into the space of digital books as well, but it will likely take sometime to coalesce.

I wrote a column about this in Against the Grain last September outlining some of these issues that this change will require.  A free copy of that article is available here.  The Library of Congress is aware, and will become painfully more so when this stream of online content begins to flow their way.  To support an understanding about these new regulations, LC hosting a forum in Washington in May to discuss publisher’s technology for providing these data on a regular basis.  Below is the description about the meeting that LC provided.

Electronic Deposit Publishers Forum
May 10-11, 2010
Library of Congress — Washington, DC

The Mandatory deposit provision of the US Copyright Law requires that published works be deposited with the US Copyright Office for use by the Library of Congress in its collection.  Previously, copyright deposits were required only for works published in a physical form, but recently revised regulations now include the deposit of electronic works published only online.  The purpose of this workshop is to establish a submission process for these works and to explore technical and procedural options that will work for the publishing community and the Library of Congress.

Discussion topics will include:

  • Revised mandatory deposit regulations
  • Metadata elements and file formats to be submitted

Space for this meeting is very limited, but if you’re interested in participating in the meeting, you should contact the Copyright Office.

  • Proposed transfer mechanisms
  • The Memento Project – adding history to the web

    Wednesday, November 18th, 2009

    Yesterday, I attended the CENDI/FLICC/NFAIS Forum on the Semantic Web: Fact or Myth hosted by the National Archives.  It was a great meeting with an overview of ongoing work, tools and new initiatives.  Hopefully, the slides will be available soon, as there was frequently more information than could be expressed in 20-minute presentations and many listed what are likely useful references for more information.  Once they are available, we’ll link through to them.

    During the meeting, I had the opportunity to run into Herbert Van de Sompel, who is at the Los Alamos National Laboratory.  Herbert has had a tremendous impact on the discovery and delivery of electronic information. He played a critical role in creating the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), the Open Archives Initiative Object Reuse & Exchange specifications (OAI-ORE), the OpenURL Framework for Context-Sensitive Services, the SFX linking server, the bX scholarly recommender service, and info URI.

    Herbert described his newest project, which has just been released, called the Memento Project. The Memento project proposes a “new idea related to Web Archiving, focusing on the integration of archived resources in regular Web navigation.”  In chatting briefly with Herbert, the system uses a browser plug-in to view the content of a page from a specified date.  It does this by using the underlying content management system change logs to recreate what appeared on a site at a given time.  The team has also developed some server-side Apache code that handles the request for calls to the management of systems that have version control.  The system can also point to a version of the content that exists in the Internet Archive (or other similar archive sites) for content from around that date, if the server is unable to recreate the requested page. Herbert and his team have tested this using a few wiki sites.  You can also demo the service from the LANL servers.

    Here is a link to a presentation that Herbert and Michael Nelson (co-collaborator on this project) at Old Dominion University gave at the Library of Congress on this project.  There was also a story about this project  A detailed paper that describes the Memento solution is also available on the arXive site.  There is also an article on Memento in the New Scientist.  Finally, tomorrow (November 19, 2009 at 8:00 AM EST), there will be a presentation on this at OCLC as part of their Distinguished Seminar Series, which will be available online for free (RSVP required).

    This is a very interesting project that addresses one of the key problems with archiving web page content, which frequently changes.  I am looking forward to the team’s future work and hoping that the project gets some broader adoption.

    Flickr project at Library of Congress

    Thursday, October 30th, 2008

    Further to the CENDI meeting held yesterday:

    Deanna Marcum was the opening speaker of the meeting and her presentation primarily focused on the report on the Future of Bibliographic Control and her response to the report.  One of the recommendations of that report was that libraries should invest in making available their special collections.  One thing that LC has in abundance is special collections.

    Deanna discussed the pilot project on Flickr to post digitized images on the service and encourage public tagging of the images.  The pilot includes scans of “1,600 color images from the Farm Security Administration/Office of War Information and 1,500+ images from the George Grantham Bain News Service.”  As of today the project has 4,665 items on Flickr.  The group has had great success in getting thousands of people to tag and enrich the images with descriptions.  In bouncing through a number of images, most of them looked like they’d received more than 2,000 views each.  That translates to more than 9 million views (although I could be overshooting the toal just because of a very small sample size) — although I know from my own account, there’s a lot of double-counting of reloading of pages.  Regardless, this is terrific amount of visibility for an image collection that many wouldn’t be able to see before they was digitized.

    In glancing through the tag list that have been added to the images, I expect that there is much that would concern a professional cataloger.  Many of the tags conform to the odd space-less text string convention on Flickr.  Also, from the perspective of making images easier to find, I’d say the results are mixed.  LC will be producing a report of their results in “in the next few weeks” (per Deanna).

    Finally, I’m not sure that providing public-domain library content to freely to commercial organizations is in the best interests of the contributing library.  This follows on some further consideration of my post yesterday on Google’s settlement with the publishing and authors communities for the Google Book project.

    After the meeting, I took the opportunity of being at the LC to see their exhibition on Creating the United States.  Yesterday was the last day of the exhibition, so unfortunately, if you hadn’t seen it already, it will be “a number of years” before LC brings back out of the vaults the Jefferson draft of the Declreation of Independence.  Along with the exhibition on the American founding, they also have on display, the Jefferson library collection and the  Waldseemüller maps.  These items are among most important maps in the history of cartography, which were the first to name the landmass across the Atlantic from Europe “America” in 1507 and 1516.  I believe the maps will continue to be on display for sometime.  I encourage anyone in the area to stop in and take a look.