Home | About NISO | Blog

Archive for the ‘cataloging’ Category

ISTC and Ur-Texts

Thursday, April 1st, 2010

Tuesday, I attended a meeting on the International Standard Text Code (ISTC), organized by the Book Industry Study Group (BISG) in Manhattan.  The meeting was held in conjunction with the release of a white paper on the ISTC by Michael Holdsworth entitled ISTC: A Work in Progress. This is a terrific paper and worthy of reading for those interested in this topic and I commend it to you all, if you haven’t seen it.  The paper provides a detailed introduction to the ISTC and what role this new identifier will play in our community.

During the meeting as I was tweeting about the standard, I got into a brief twitter discussion with John Mark Ockerbloom at the University of Pennsylvania Library.  Unfortunately as wonderful as Twitter is for instantaneous conversation, it is not at all easy to communicate nuance.    For that, a longer form is necessary, hence this blog post.

As a jumping off point, let us start with the fact that the ISTC has a fairly good definition about what it is identifying: the text of a work as a distinct abstract item that may be the same or different across different products or manifestations.  Distinguishing between those changes can be critical, as is tying together the various manifestations for collection development, rights and product management reasons.

One of the key principles of the ISTC is that:

“If two entities share identical ISTC metadata, they shall be treated as the same textual work and shall have the same ISTC.”

Where to draw this distinction is quite an interesting point.  As John pointed out in his question to me, “How are works with no definitive original text handled? (e.g. Hamlet) Is there an #ISTC for some hypothetical ur-Hamlet?”  The issue here is that there are multiple “original versions” of the text of Hamlet. Quoting from Wikikpedia: “Three different early versions of [Hamlet] have survived: these are known as the First Quarto (Q1), the Second Quarto (Q2) and the First Folio (F1). Each has lines, and even scenes, that are missing from the others.”

In this case, the three different versions would each have three different ISTCs assigned to them, since the text of the versions is different.  They could be noted as related to the other ISTCs (as well as the cascade of other related editions) in the descriptive metadata fields.  Hamlet is a perfect example of where the ISTC could be of critical value, since those who have an interest in the variances between the three different versions would want to know which text is the basis of the copy of Hamlet they are purchasing, since there are significant differences between the three copies.

Perhaps most stringent solution in keeping with the letter of the standard might be that the First Quatro, have been the first known to published, since it was the first to appear in the Stationers’ Register in 1602 although it likely was not published until summer or fall 1603.  The Second Quarto and First Folio were published later—in 1604 and 1623 respectively.  Although the first Quatro is often considered “inferior” to later versions, assigning it the “Source” ISTC would be no different than if it were published today, and subsequently re-published as a revision (which would be assigned a related ISTC).  While there has been controversy about the source text of Hamlet that probably began not long after the day it was published and has certainly grown as the field of scholarship around Shakespeare has grown, for the purposes of identification and linking does the “Ur-text” matter?

Certainly, a user would want to know that this is the canonical version, be that the Second Quatro or First Folio versions.  The critical point is that we identify things differently when there are important reasons to make the distinctions.  In the case of Hamlet, there is a need to make the distinction.  Which copy is considered “original” and which is a derivative isn’t nearly as important as making the distinction.

It is valuable to note the description in the ISTC User’s Manuel in the section on Original works and derivations.  Quoting from the Manuel:

7.1    What is an “original” work?

For the purposes of registration on the ISTC database, a work may be regarded as being “original” if it cannot be adequately described using one or more of the controlled values allowed for the “Derivation Type” element (specified elsewhere in this document).

A work is considered to be “original” for registration purposes unless it replicates a significant proportion of a previously existing work or it is a direct translation of the previously existing one (where all the words may be different but the concepts and their sequence are the same). It should be noted that this is a different approach from that used by FRBR2, which regards translations as simply different “expressions” of the same work.

The “Source ISTC” metadata field is an optional one and is “Used to identify the original work(s) from which this one is derived (where appropriate). It is recommended that these are provided whenever possible.”  In the case of the three Hamlet “original versions” this field would likely be left blank, since there is no way to distinguish between the “Original” and the “Derivation”.  Each of the three versions could be considered “Original”, but this would get messy if one were not noted as original.   There is a “Derivation type” metadata field with restricted values, although “Unspecified” is one option.  Since there isn’t necessarily a value in the “original” distinction, there isn’t a point arguing about which is original.  In the real world, what will likely be the “original” will be the first version that receives the assignment.

This same problem will likely be true of a variety of other texts, especially from distant historical periods.   A focus on core principles, that we distinguish what is important, that disambiguation is important, and avoiding the philosophical arguments surrounding “original” versus “derivative”, just as the ISTC community is trying to avoid “ownership” of the record, will help to serve the entire community.

There is a lot more information about the ISTC provided by NISO. Members and subscribers can read the article that Andy Weissberg VP of Identifier Services & Corporate Marketing at Bowker wrote in Information Standards Quarterly last summer, The International Standard Text Code (ISTC): An Overview and Status Report. For non-subscribers, Andy Weissberg also presented during the 2009 NISO-BISG Changing Standards Landscape forum prior to ALA’s Annual conference in Chicago.  You can view his presentation slides or watch the video from that meeting.

The International ISTC Agency Ltd is a not-for-profit company, limited by guarantee and registered in England and Wales. Its sole purpose is to implement and promote the ISO 21047 (ISTC) standard and it is operated by representatives of its founding members, namely RR Bowker, CISAC, IFRRO, and Nielsen Book Services.

The first edition of “ISO 21047 Information and Documentation – International Standard Text Code (ISTC)” was published by ISO in March 2009. It is available for purchase in separate English and French versions either as an electronic download or printed document from ISO.

Life partners with Google to post photo archive online

Wednesday, December 3rd, 2008

Life magazine, which ceased as an ongoing publication in April of 2007, has partnered with Google to digitize and post the magazine’s vast photo archive.  Most of the collection has never been seen publicly and amounts to a huge swath of America’s visual history since the 1860s.   The release of the collection was announced on the Google Blog.  The first part of the collection is now online, with the remaining 80% being digitized over the next “few months”.  Of course, this does not mean that all images in Life will be online, only those that were produced by the staff photographers (i.e., where Life holds the copyright), not the famous freelancers.

I can find no where any mention of money exchanged either from Google for the rights or for a revenue stream to support the ongoing work, although one can purchase prints of the images.  From a post on this from paidcontent.org:

  Time Inc.’s hopes, Life president Andy Blau explains: “We did this deal for really one reason, to drive traffic to Life.com. We wanted to make these images available to the greater public … everything else from that is really secondary.”  

While exploring the collection, I also noticed Google’s Image Labler, a game to tag images.  The goal of the game is to get points by matching your tags with those of another random player, when you both see the same images.  The game was launched in September of 2006. While I spent about 5 minutes using it, what is truly scary is the number of points raked up by the “all time leaders”. As of today, “Yew Half Maille” had collected 31,463,230. Considering that I collected about 4,000 points in my 5 minutes, how much time are people spending doing this?

Changing the ideas of a catalog: Do we really need one?

Wednesday, November 19th, 2008

Here’s one last post on thoughts regarding the Charleston Conference.

Friday afternoon during the Charleston meeting, Karen Calhoun, Vice President, WorldCat and Metadata Services at OCLC and Janet Hawk, Director, Market Analysis and Sales Programs at OCLC gave a joint presentation entitled: Defining Quality As If End Users Matter: The End of the World As We Know It(link to presentations page – actual presentation not up yet). While this program focused on the needs, expectations and desired functionality of users of WorldCat, there was an underlying theme which came out to me and could have deep implications for the community.

“Comprehensive, complete and accurate.” I expect that every librarian, catalogers in particular, would strive to achieve these goals with regard to the information about their collection. The management of the library would likely add cost-effective and efficient to this list as well. Theses goals have driven a tremendous amount of effort at almost every institution when building its catalog. Information is duplicated, entered into systems (be they card catalogs, ILS or ERM systems) and maintained, eventually migrated to new systems. However, is this the best approach?

When you log into the Yahoo web page, for example, the Washington Post, or a service like Netvibes or Pageflakes, what you are presented with is not information culled from a single source, or even 2 or three. On my Netvibes landing page, I have information pulled from no less than 65 feeds, some mashed up, some straight RSS feeds. Possibly (probably), the information in these feeds is derived from dozens of other systems. Increasingly, what the end-user experiences might seem like an integrated and cohesive experience, however on the back-end the page is drawing from multiple sources, multiple formats, multiple streams of data. These data stream could be aggregated, merged and mashed up to provide any number of user experiences. And yet, building a catalog has been an effort to build a single all-encompassing system with data integrated and combined into a single system. It is little wonder that developing, populating and maintaining these systems requires tremendous amounts of time and effort.

During Karen’s and Janet’s presentation last week provided some interesting data about the enhancements that different types of users would like to see in WorldCat and WorldCatLocal. The key take away was that there were different users of the system, with different expectations, needs and problems. Patrons have one set of problems and desired enhancements, while librarians have another. Neither is right or wrong, but represent different sides of the same coin – what a user wants depends entirely on what the need and expect from a service. This is as true for banking and auto repair as it is for ILS systems and metasearch services.

    Putting together the pieces.

Karen’s presentation followed interestingly from another session that I attended on Friday in which Andreas Biedenbach, eProduct Manager Data Systems & Quality at Springer Science + Business Media, spoke about the challenges of supplying data from a publisher’s perspective. Andreas manages a team that distributes metadata and content to the variety of complicated users of Springer data. This includes libraries, but also a diverse range of other organizations such as aggregators, A&I services, preservation services, link resolver suppliers, and even Springer’s own marketing and web site departments. Each of these users of the data that Andreas’ team supplies has their own requirements, formats and business terms, which govern the use of the data. Some of these streams are complicated feeds of XML structures to simple comma-separated text files. Each of which is in its own format, some standardized, some not. It is little wonder there are gaps in the data, non-conformance, or format issues. Similarly, it is not a lack of appropriate or well-developed standards as much as it is conformance, use and rationalization. We as a community cannot continue to provide customer-specific requests to data requests for data that is distributed into the community.

Perhaps the two problems have a related solution. Rather than the community moving data from place to place, populating their own systems with data streams from a variety of authoritative sources could a solution exist where data streams are merged together in a seamless user interface? There was a session at ALA Annual hosted by OCLC on the topic of mashing up library services. Delving deeper, rather than entering or populating library services with gigabytes and terabytes of metadata about holdings, might it be possible to have entire catalogs that were mashed up combinations of information drawn from a range of other sources? The only critical information that a library might need to hold is an identifier (ISBN, ISSN, DOI, ISTC, etc) of the item they hold drawing additional metadata from other sources on demand. Publishers could supply a single authoritative data stream to the community, which could be combined with other data to provide a custom view of the information based on the user’s needs and engagement. Content is regularly manipulated and represented in a variety of ways by many sites, why can’t we do the same with library holdings and other data?

Of course, there are limitations to how far this could go: what about unique special collections holdings; physical location information; cost and other institution-specific data. However, if the workload of librarians could be reduced in significant measure by mashing up data and not replicating it in hundreds or thousands of libraries, perhaps it would free up time to focus on other services that add greater value to the patrons. Similarly, simplifying the information flow out of publishers would reduce errors and incorrect data, as well as reduce costs.

Flickr project at Library of Congress

Thursday, October 30th, 2008

Further to the CENDI meeting held yesterday:

Deanna Marcum was the opening speaker of the meeting and her presentation primarily focused on the report on the Future of Bibliographic Control and her response to the report.  One of the recommendations of that report was that libraries should invest in making available their special collections.  One thing that LC has in abundance is special collections.

Deanna discussed the pilot project on Flickr to post digitized images on the service and encourage public tagging of the images.  The pilot includes scans of “1,600 color images from the Farm Security Administration/Office of War Information and 1,500+ images from the George Grantham Bain News Service.”  As of today the project has 4,665 items on Flickr.  The group has had great success in getting thousands of people to tag and enrich the images with descriptions.  In bouncing through a number of images, most of them looked like they’d received more than 2,000 views each.  That translates to more than 9 million views (although I could be overshooting the toal just because of a very small sample size) — although I know from my own account, there’s a lot of double-counting of reloading of pages.  Regardless, this is terrific amount of visibility for an image collection that many wouldn’t be able to see before they was digitized.

In glancing through the tag list that have been added to the images, I expect that there is much that would concern a professional cataloger.  Many of the tags conform to the odd space-less text string convention on Flickr.  Also, from the perspective of making images easier to find, I’d say the results are mixed.  LC will be producing a report of their results in “in the next few weeks” (per Deanna).

Finally, I’m not sure that providing public-domain library content to freely to commercial organizations is in the best interests of the contributing library.  This follows on some further consideration of my post yesterday on Google’s settlement with the publishing and authors communities for the Google Book project.

After the meeting, I took the opportunity of being at the LC to see their exhibition on Creating the United States.  Yesterday was the last day of the exhibition, so unfortunately, if you hadn’t seen it already, it will be “a number of years” before LC brings back out of the vaults the Jefferson draft of the Declreation of Independence.  Along with the exhibition on the American founding, they also have on display, the Jefferson library collection and the  Waldseemüller maps.  These items are among most important maps in the history of cartography, which were the first to name the landmass across the Atlantic from Europe “America” in 1507 and 1516.  I believe the maps will continue to be on display for sometime.  I encourage anyone in the area to stop in and take a look.