We held the last of the Mellon-funded Thought Leader Meeting series Wednesday. The topic of this meeting was on Research Data and explored many of the issues surrounding the use, reuse, preservation, and citation of data in scholarship. Like the three previous meetings, it was a great success. The meeting brought together a number of representatives from the research, publisher, library and system developer communities. A list of the representatives is below.
Research data is becoming increasingly critical in almost every area of scholarship. From census data to high-energy physics, and medical records to the humanities, the range of types of data and the uses which researchers apply this data has expanded dramatically in the past decade. Managing this data, finding, accessing and curating it is a growing problem. A report produced by IDC earlier this year concluded that the amount of digital data created exceeded the total available storage capacity in the world. Determining which aspects are most valuable and adding value through curation will be a tremendous project in the coming decades.
In order to be useful (in a scientific sense), data needs to verifiable, identifiable, reference-able, preservable, much in the way that published materials are. Obviously, this poses many questions: When referring to a data set that is constantly being updated or appended, what would you be citing? What if the results are modeled from a subset? Again the data set isn’t as relevant to the citation as which portion of the larger set were used, as well as the model and criteria that were used in the analysis. Additionally, models and software that are used on a specific data set would be critical to determining the validity of any results or conclusions drawn from the data. In the peer-review process of science, each of these aspects would need to be considered. Some publishers are already considering these issues and review criteria. In the future, these issues will only grow for publishers, societies and scientists as they consider the output of science.
Another issue is the variety of life cycles for different types of data. In fields such as chemistry, there is a much shorter half life in the usefulness of a dataset than it might be in the humanities or social sciences. This could effect the value proposition of whether to curate a dataset. Some work done by the JISC had been focused on mandating deposit of materials for the purpose of preservation. Unfortunately, the project didn’t succeed and was withdrawn in 2007. One of the potential reasons that more than $3 million investment turned out to be a disappointment was possibly its focus on archiving and preservation of the data deposited and not focused on reuse and application of deposited data. In order for the preservation to be deemed worth the investment, simultaneous focus on the reuse of the data is critical to ensuring that the investment sees some form of return — apart from developing a large repository of never-accessed data.
While there was some discussion during the day that related to encouraging use and sharing of research data and methodologies, technical standards will not help with what is inherently a political question. Many of the rewards and recognition in the scholarly process come back to the formalities of publication, which have developed over centuries. As with many standards-related questions, the problems are not normally related to technologies per se, but often hinge on the political or social conventions that support certain activities. That said, the development of citation structures, descriptive metadata conventions, discovery methodologies, and curation strategies will add to the growing trends of utilizing these data forms in scholarly communications. By expanding their use and ensuring that the content if preserved and citable, NISO could help encourage expanded use of data in the communication process.
The report of this meeting will be publicly available in a few weeks on the NISO website along with the other reports. NISO’s leadership committee structure will be reviewing the recommendations and deciding which initiatives to push forward with in the coming months.
Research Data Thought Leader Participants:
Clifford Lynch, Coalition for Networked Information
Ellen Kraffmiller, Dataverse Network
Paul Uhlir, National Academy of Sciences
Lars Bromley, AAAS
Robert Tansley, Google
Jean Claude Bradley, Drexel University
Camelia Csora, 2collab, Elsevier
MacKenzie Smith, MIT Libraries – DSpace
Stuart Weibel, OCLC