Home | About NISO | Blog

Archive for the ‘metadata’ Category

NISO response to the National Science Board on Data Policies

Wednesday, January 18th, 2012

Earlier this month, the National Science Board (NSB) announced it was seeking comments from the public on the report from the Committee on Strategy and Budget Task Force on Data Policies, Digital Research Data Sharing and Management.  That report was distributed last December.

NISO has prepared a response on behalf of the standards development community, which was submitted today.  Here are some excerpts of that response:

The National Science Board’s Task Force on Data Policies comes at a watershed moment in the development of an infrastructure for data-intensive science based on sharing and interoperability. The NISO community applauds this effort and the focused attention on the key issues related to a robust and interoperable data environment.

….

NISO has particular interest in Key Challenge #4: The reproducibility of scientific findings requires that digital research data be searchable and accessible through documented protocols or method. Beyond its historical involvement in these issues, NISO is actively engaged in forward-looking projects related to data sharing and data citation. NISO, in partnership with the National Federation of Advanced Information Services (NFAIS), is nearing completion of a best practice for how publishers should manage supplemental materials that are associated with the journal articles they publish. With a funding award from the Alfred P. Sloan Foundation and in partnership with the Open Archives Initiative, NISO began work on ResourceSync, a web protocol to ensure large-scale data repositories can be replicated and maintained in real-time. We’ve also had conversations with the DataCite group for formal standardization of their IsCitedBy specification. [Todd Carpenter serves] as a member of the ICSTI/CODATA task force working on best practices for data citation and NISO is looking forward to promoting and formalizing any recommendations and best practices that derive from that work.

….

We strongly urge that any further development of data-related best practices and standards take place in neutral forums that engage all relevant stakeholder communities, such as the one that NISO provides for consensus development. As noted in Appendix F of the report, Summary Notes on Expert Panel Discussion on Data Policies, standards for descriptive and structural metadata and persistent identifiers for all people and entities in the data exchange process are critical components of an interoperable data environment. We cannot agree more with this statement from the report of the meeting: “Funding agencies should work with stakeholders and research communities to support the establishment of standards that enable sharing and interoperability internationally.”

There is great potential for NSF to expand its leadership role in fostering well-managed use of data. This would include not only support of the repository community, but also in the promulgation of community standards. In partnership with NISO and using the consensus development process, NSF could support the creation of new standards and best practices. More importantly, NSF could, through its funding role, provide advocacy for—even require—how researchers should use these broad community standards and best practices in the dissemination of their research. We note that there are more than a dozen references to standards in Digital Research Data Sharing and Management report, so we are sure that this point is not falling on unreceptive ears.

The engagement of all relevant stakeholders in the establishment of data sharing and management practices as described in Recommendation #1 is critical in today’s environment—at both the national and international levels. While the promotion of individual communities of practice is a laudable one, it does present problems and issues when it comes to systems interoperability. A robust system of data exchange by default must be one grounded on a core set of interoperable data. More often than not, computational systems will need to act with a minimum of human intervention to be truly successful. This approach will not require a single schema or metadata system for all data, which is of course impossible and unworkable. However, a focus on and inclusion of core data elements and common base-level data standards is critical. For example, geo-location, bibliographic information, identifiers and discoverability data are all things that could be easily standardized and concentrated on to foster interoperability. Domain-specific information can be layered over this base of common and consistent data in a way that maintains domain specificity without sacrificing interoperability.

One of the key problems that the NSB and the NSF should work to avoid is the proliferation of standards for the exchange of information. This is often the butt of standards jokes, but in reality it does create significant problems. It is commonplace for communities of interest to review the landscape of existing standards and determine that existing standards do not meet their exact needs. That community then proceeds to duplicate seventy to eighty percent of existing work to create a specification that is custom-tailored to their specific needs, but which is not necessarily compatible with existing standards. In this way, standards proliferate and complicate interoperability. The NSB is uniquely positioned to help avoid this unnecessary and complicating tendency. Through its funding role, the NSB should promote the application, use and, if necessary, extension of existing standards. It should aggressively work to avoid the creation of new standards, when relevant standards already exist.

The sharing of data on a massive scale is a relatively new activity and we should be cautious in declaring fixed standards at this state. It is conceivable that standards may not exist to address some of the issues in data sharing or that it may be too early in the lifecycle for standards to be promulgated in the community. In that case, lower-level consensus forms, such as consensus-developed best practices or white papers could advance the state of the art without inhibiting the advancement of new services, activities or trends. The NSB should promote these forms of activity as well, when standards development is not yet an appropriate path.

We hope that this response is well received by the NSB in the formulation of its data policies. There is terrific potential in creating an interoperable data environment, but that system will need to be based on standards and rely on best practices within the community to be fully functional. The scientific community, in partnership with the library, publisher and systems provider communities can all collectively help to create this important infrastructure. Its potential can only be helped by consensus agreement on base-level technologies. If development continues in a domain-centered path, the goal of interoperability and delivering on its potential will only be delayed and quite possibly harmed.

The full text PDF of the entire response is available here.  Comments from the public related to this document are welcome.

Trust but verify: Are you sure this document is real?

Tuesday, November 3rd, 2009

Continuing on the theme of a “leaked” document that was posted last week from a systems supplier in the community.  One thing that few asked initially regarding this document is: “Is it real?”  In this case, not 24 hours after the document was “released”, it was confirmed by the author that he had written the document and that it had been circulating for some time. However, it is amazing the stir that can be started by posting a PDF document anonymously on the Wikileaks website, regardless of its provenance.

Last week was the 40th anniversary of the “birth” of the internet, when two computers were first connected using a primitive router and transmitted the first message from two computers: “Lo”.  They were trying to send the command “Login”, but the systems crashed before the full message was sent. Later that evening, they were able to get the full message through and with that the internet – in a very nascent form was born.  During a radio interview that week, Dr. Leonard Kleinrock, Professor of Computer Science, UCLA, who was a one of the scientists that was working on those systems that night, spoke about the event.  During one of the questions, Dr. Klenirock was asked about the adoption of IP Version 6. His response was quite fascinating:

Dr. KLEINROCK: Yes. In fact, in those early days, the culture of the Internet was one of trust, openness, shared ideas. You know, I knew everybody on the Internet in those days and I trusted them all. And everybody behaved well, so we had a very easy, open access. We did not introduce any limitations nor did we introduce what we should have, which was the ability to do strong user authentication and strong file authentication. So I know that if you are communicating with me, it’s you, Ira Flatow, and not someone else. And if you send me a file, I receive the file you intended me to receive.

We should’ve installed that in the architecture in the early days. And the first thing we should’ve done with it is turn it off, because we needed this open, trusted, available, shared environment, which was the culture, the ethics of the early Internet. And then when we approach the late ‘80s and the early ‘90s and spam, and viruses, and pornography and eventually the identity theft and the fraud, and the botnets and the denial of service we see today, as that began to emerge, we should then slowly have turned on that authentication process, which is part of what your other caller referred to is this IPV6 is an attempt to bring on and patch on some of this authentication capability. But it’s very hard now that it’s not built deep into the architecture of the Internet.

The issue of provenance has been a critical gap in the structure of the internet from the very beginning.  At the outset, when the number of computers and people who were connected to the network was small, the issue of authentication and validation were significant barriers to a working system.  If you know and trust everyone in your neighborhood, locking your doors is an unnecessary hassle.  In a large city, where you don’t know all of your neighbors, locking your doors is a critical routine that becomes second nature.  In our digital environment, the community has gotten so large that locking doors, authenticating and passwords to ensure you are who you claim to be is essential to a functioning community.

Unfortunately, as Dr. Kleinrock notes, we are in a situation where we need to patch some of the authentication and provenance holes in our digital lives.  This brings me back to the document that was distributed last week via Wikileaks.

There is an important need, particularly in the legal and scientific communities that provenance be assured.  With digital documents, which are easily manipulated or created and distributed anonymously, confirming the author and source of a document can be.  Fortunately, in this case, the authorship can be and was confirmed easily and quickly enough.  However, in many situations this is not the case, particularly for forged or manipulated documents.  Even when denials are issued, there is no way to prove the negative to a doubtful audience.

The tools for creating extremely professional looking documents are ubiquitous.  Indeed, the same software that most publishers companies use to create formal published documents is available to almost anyone with a computer.  It would not be difficult to create one’s own “professional” documents and distribute them as real.  The internet is full of hoaxes of these sorts and they run the gamut from absurd, to humorous, to quite damaging.

There have been discussions about the need for better online provenance information for nearly two decades now. Some work on metadata provenance is gaining broader adoption including PREMIS, METS and DCMI, some significant work on standards remains regarding the authenticity of documents.  The US Government and the Government Printing Office has made progress with the GPO Seal of Authenticity and digital signature/public key technology in Acrobat v. 7.0 & 8.0.  In January, 2009, GPO digitally signed and certified PDF files of all versions of Congressional bills introduced during the 111th and 110th Congresses. Unfortunately, these types of authentication technologies have not been broadly adopted outside the government.  The importance of provenance metadata was also re-affirmed in a recent Arizona Supreme Court case.

Although it might not help in every case, knowing the source of a document is crucial in assessing its validity.  Until standards are broadly adopted and relied upon, a word of warning to the wise about content on the Internet: “Trust but verify.”

Life partners with Google to post photo archive online

Wednesday, December 3rd, 2008

Life magazine, which ceased as an ongoing publication in April of 2007, has partnered with Google to digitize and post the magazine’s vast photo archive.  Most of the collection has never been seen publicly and amounts to a huge swath of America’s visual history since the 1860s.   The release of the collection was announced on the Google Blog.  The first part of the collection is now online, with the remaining 80% being digitized over the next “few months”.  Of course, this does not mean that all images in Life will be online, only those that were produced by the staff photographers (i.e., where Life holds the copyright), not the famous freelancers.

I can find no where any mention of money exchanged either from Google for the rights or for a revenue stream to support the ongoing work, although one can purchase prints of the images.  From a post on this from paidcontent.org:

  Time Inc.’s hopes, Life president Andy Blau explains: “We did this deal for really one reason, to drive traffic to Life.com. We wanted to make these images available to the greater public … everything else from that is really secondary.”  

While exploring the collection, I also noticed Google’s Image Labler, a game to tag images.  The goal of the game is to get points by matching your tags with those of another random player, when you both see the same images.  The game was launched in September of 2006. While I spent about 5 minutes using it, what is truly scary is the number of points raked up by the “all time leaders”. As of today, “Yew Half Maille” had collected 31,463,230. Considering that I collected about 4,000 points in my 5 minutes, how much time are people spending doing this?

From the Charleston conference: On Trust

Friday, November 7th, 2008

I’m at the Charleston Library Conference this week.  As always, it’s a great meeting with terrific presentations and hallway conversations.  NISO is well represented on the program, with discussions of SUSHI, I2, ONIX-PL and JAV among others.
The unofficial theme of this week’s meeting seems to be trust.  Wednesday night over dinner, I had a philosophical discussion with Mark Kurtz at BioOne and Pete Binfield at PLOS about what are the core value-added services that publishers provide.  One point that was made during the conversation was that certification and validation are among the greatest services that publishers add to the publication process.  In a world where the tools and platforms to self-publish are ubiquitous and easily applied so that “publishing” no longer needs to involve a publisher, what value do publishers bring to this process? Validation and certification are critical, but also the reliance of readers on this process to more easily gauge what should be read.

Geoff Bilder spoke yesterday morning about trust heuristics and how do readers gauge what is worth reading.  One of his points during the presentation is that with the increasing breadth and depth of published information, researchers need to have quick and easily understood signals regarding quality. This echoes the theme of my post on  James J. O’Donnell’s presentation at the ARL members meeting.

Geoff suggested that there be some logos be developed that provide information about the quality of a particular article and the types and stages of review or vetting that an article had gone through. The logo could also contain machine-readable metadata, which would provide information about the type and rigor of the review that was applied in the publication process.  Geoff has been exploring this as a potential new activity at CrossRef. My sense is that there is a great deal of value in this approach and it’s worthy of support in the community.

More from the conference tomorrow.

Flickr project at Library of Congress

Thursday, October 30th, 2008

Further to the CENDI meeting held yesterday:

Deanna Marcum was the opening speaker of the meeting and her presentation primarily focused on the report on the Future of Bibliographic Control and her response to the report.  One of the recommendations of that report was that libraries should invest in making available their special collections.  One thing that LC has in abundance is special collections.

Deanna discussed the pilot project on Flickr to post digitized images on the service and encourage public tagging of the images.  The pilot includes scans of “1,600 color images from the Farm Security Administration/Office of War Information and 1,500+ images from the George Grantham Bain News Service.”  As of today the project has 4,665 items on Flickr.  The group has had great success in getting thousands of people to tag and enrich the images with descriptions.  In bouncing through a number of images, most of them looked like they’d received more than 2,000 views each.  That translates to more than 9 million views (although I could be overshooting the toal just because of a very small sample size) — although I know from my own account, there’s a lot of double-counting of reloading of pages.  Regardless, this is terrific amount of visibility for an image collection that many wouldn’t be able to see before they was digitized.

In glancing through the tag list that have been added to the images, I expect that there is much that would concern a professional cataloger.  Many of the tags conform to the odd space-less text string convention on Flickr.  Also, from the perspective of making images easier to find, I’d say the results are mixed.  LC will be producing a report of their results in “in the next few weeks” (per Deanna).

Finally, I’m not sure that providing public-domain library content to freely to commercial organizations is in the best interests of the contributing library.  This follows on some further consideration of my post yesterday on Google’s settlement with the publishing and authors communities for the Google Book project.

After the meeting, I took the opportunity of being at the LC to see their exhibition on Creating the United States.  Yesterday was the last day of the exhibition, so unfortunately, if you hadn’t seen it already, it will be “a number of years” before LC brings back out of the vaults the Jefferson draft of the Declreation of Independence.  Along with the exhibition on the American founding, they also have on display, the Jefferson library collection and the  Waldseemüller maps.  These items are among most important maps in the history of cartography, which were the first to name the landmass across the Atlantic from Europe “America” in 1507 and 1516.  I believe the maps will continue to be on display for sometime.  I encourage anyone in the area to stop in and take a look.

CENDI Meeting on Metadata and the future of the iPod

Wednesday, October 29th, 2008

I was at the CENDI meeting to speak today about metadata and new developments related to metadata. There were several great presentations during the morning and some worthy of additional attention. My particular presentation is here.

The presenter prior to me was Dr. Carl Randall, Project Officer from the Defense Technical Information Center (DTIC). Carl’s presentation was excellent. He spoke to the future of search and a research report that he wrote on Current Searching Methodology And Retrieval Issues: An Assessment. Carl ended hispresentation with a note about an article he’d just read entitled Why the iPod is Doomed written by Kevin Maney for portfolio.com.

The article was focused on why the Pod was doomed. The author posits that the technology of the iPod is outdated and will soon be replaced by online “cloud” computing services. To paraphrase from the article: The more entrenched a business is, the less likely it will be able to change when new competitors arise to challenge its existing model.

Another great quote from the article– “In tech years, they [i.e, the iPod and iTunes] are older than Henry Kissinger.”

I don’t quibble with the main tenant of the article; that services will move to the web and that we will think it quaint to have to purchase content and download individual songs, then carry around those songs on hard drives, which store those files locally. The iPod hardware and the iTunes model of by-the-drink downloads are both likely to have limited futures. I do think that Apple is probably better placed to transition their iTunes service to a subscription or cloud-based model than any others through their iPhones. The article dismisses this as unlikely because Apple hasn’t talked about it. This dismisses the fact that Apple never talks about their plans until theyare ready to announce a product of service.

As we move to an era of “cloud” computing, where both applications content are hosted on the network not on individual devices, it is likely that people will desire to purchase subscription access to all content on demand as opposed to the limited content that they specifically purchase.

A subscription model also provides new opportunity to expose users to new content. From my perspective, despite having over 10,000 songs in my iTunes library, I’ve been reluctant to purchase new content that I wasn’t already familiar with. I have used LastFM and other services (anyone remember FM radio) to become acquainted with new music. Part of the reason for this is that the barrier for me is time rather than cost, but I expect that the perceived cost issue is one for many potential users. I say “perceived” because, much research and practical experience shows us consumers will pay more for ongoing subscription services than they will for one-time up-front costs.

Moving content to the “cloud” provides many opportunities for content providers to exercise a measure of control that had been lost. By hosting files, rather than distributing them (streaming as distinct from downloading, for example) the content providers have greater ability to control distribution. Access becomes an issue ofauthentication and rights management, as opposed to DRM wrapping and other more onerous and intrusive approaches. Many of us have become quite comfortable with renting movies through Blockbuster, Netflix or cable OnDemand services.

There are downsides for the customers for moving to the cloud. There are very different rights associated with “renting” a thing (content, cars, houses, etc.) versus owning those things. How interested users will be in skipping those rights for the convenience of the cloud is an open question. Likely, the convenience will override the long-term interest in the rights. Frequently, it isn’t until someone realizes that they don’t have any control over the cloud is when they are burned by the owners of the services take them away in some fashion. If you’ve stored all of your photos on Flickr and the company deletes your account for whatever reason, you’ll wish that you had more control over the terms of service. From my perspective, I’d rather retain ownership and control the content I’ve purchased in those areas where I’m invested in preserving access or rights to reuse. I don’t know that the majority of users share my view on this; likely because they don’t spend much time thinking about the potential impacts.

This is something, in particular, libraries should be focused on having outsourced preservation of digital content to the publishers and organizations like Portico.

However, I do know that looking at these distribution models is a huge opportunity for suppliers of content in all forms. The risks of not acting or reacting are that a new upstart provider will displace the old. I grew up in Rochester, NY where Kodak was king in photography around the world for decades. Now Kodak is but a shadow of its former self, looking for a new business model in an era of digital imaging, not film and processing, which were its specialty.