Show Me the Data: Managing Data Sets for Scholarly Content
August 11, 2010

Below are listed questions that were submitted during the NISO Data Sets webinar. Answers from the presenters will be added when available. Not all the questions could be responded to during the live webinar, so those that could not be addressed at the time are also included below.

Speakers:

  • New Models for Publications and Datasets: Dryad
    Dr. Jane Greenberg, Professor, School of Information and Library Science, University of North Carolina at Chapel Hill
  • Persistent Citation & Identification for Datasets: DataCite and EZID
    Joan Starr, Strategic & Project Planning Manager, California Digital Library
    John Kunze, Associate Director, UC Curation Center, California Digital Library
  • From Documents to Data: Challenges in Linking, Aggregating and Citing
    Joel Hammond, Senior Director, Product Management & Development, Healthcare & Science, Thomson Reuters

Feel free to contact us if you have any additional questions about library, publishing, and technical services standards, standards development, or if you have have suggestions for new standards, recommended practices, or areas where NISO should be engaged.

NISO Webinar Questions and Answers

  1. In Dryad, can there be more than one DOI for data package(s) for each article? Or is it envisioned that these will be one-to-one? Where is the support document for Dryad on the DOI process located?

    Answer (Jane Greenberg): 
    Currently, one article corresponds to one data package, and each data package is assigned one DOI. Individual data files are assigned additional DOIs that include the data package DOI as a stem. Updated versions of data packages and data files are also assigned unique DOIs with common stems. For details, see our wiki (https://www.nescent.org/wg_dryad/DOI_Usage).

    We are still discussing how to accommodate cases where an article reuses one or more data files that are already in the repository, and other more complex relationships between data packages, data files, and articles.

  2. Any statistics on the use of the deposited data in Dryad? Any evidence that data sets in Dryad are indeed being reused? Have you been tracking if other people are using the data in dryad? Is there any feedback to the authors/submitters on how frequently their data is downloaded in Dryad?

    Answer (Jane Greenberg): 
    This is difficult to answer, since the repository is quite young. Most content has been there for less than a year, and awareness of Dryad is still low. Some datasets have nonetheless already been accessed and downloaded repeatedly (e.g., DOI:10.5061/dryad.234 was viewed 322 times and downloaded 102 times in July alone). Usage statistics such as these will soon be publicly viewable for all data packages. We regularly monitor usage information to identify those data packages that merit additional curatorial investment.

  3. Are you looking at using something other than DSpace down the road; why was DSpace chosen for Dryad?

    Answer (Jane Greenberg):
    We chose DSpace for several reasons:

    1. It was easy to set up an initial instance of the repository that had most of the functionality we needed. This was critical in the early stages of Dryad development, enabling us to get buy-in from the scientific community while we developed more advanced features.
       
    2. DSpace has strong support for submissions by end users. Although the standard DSpace submission system does not have the ease-of-use we desire, we have been able to expand on the basic framework, rather than completely building a submission system from scratch.
       
    3. The DSpace user community has needs that strongly align with the needs of the Dryad repository. This alignment benefits both groups as new features are developed. In the short time that Dryad has been active, we have already benefited from several features developed by other DSpace community members. On the other hand, some features developed specifically for Dryad have been added to the core DSpace software, which benefits the community while reducing the maintenance burden for the Dryad team in the future.

    Our current plans are to remain with DSpace. We are, however, strongly in support of the movement by the DuraSpace community to create “DSpace with Fedora Inside,” as we believe the Fedora architecture is much better suited to long-term preservation than the current internal architecture of DSpace.

  4. Will there be relationships developed between Dryad and other information organizations, such as Web of Science?

    Answer (Jane Greenberg): Many have suggested that one of the strongest incentives for data archiving is that researchers could receive professional credit through data citations. However, much standardization is needed before data citations can be easily tracked. Dryad will work with organizations such as Thompson-Reuters, CrossRef, etc. towards this goal. As part of this effort, we are actively engaged with DataCite activities through the California Digital Library and the British Library.

  5. Hi Jane, Could you elaborate on the degree of automation for e-mail exchanges, DOIs, etc.? I'm interested in further details about how to do the entity extraction/metadata creation step from data sources. Could you explain that further?

    Answer (Jane Greenberg): The process begins when a publisher sends an acceptance email to the author and to Dryad. The e-mail includes bibliographic data for the journal, an abstract, and keywords. Dryad automatically populates the “acquisition record” by parsing structured metadata from this e-mail. When the author follows the autogenerated URL to that record and uploads their data files, these metadata can be amended or augmented by the author and curator. The e-mails are programmed into the editorial workflow by the journal’s manuscript processing system.

    The workflow for DOIs is more labor-intensive. Currently, DataCite DOIs are autogenerated (using EZID) upon deposit, then manually checked by a curator to make sure they resolve correctly before being sent to the journal and author. The curator monitors new issues of each journal and manually adds the DOI for each article as it is published. We are working to automate this workflow.

    Also see: Greenberg, J. (2009). "Theoretical Considerations of Lifecycle Modeling: An Analysis of the Dryad Repository" Cataloging & Classification Quarterly, 47 (3/4): 380-402. DOI: 10.1080/01639370902737547. Offprint under Dryad publications at: https://www.nescent.org/wg_dryad/Publications.

  6. I'm puzzled why there are data packages in Dryad in the first place. Why not simply assign DOIs to data set metadata?

    Answer (Jane Greenberg): 
    One reason for this is to enable the accumulation of data citations for the package as a whole, rather than have such citations be distributed among its parts. This is less disruptive of the citation conventions already in place for articles (such as limits to the number of references cited) and thereby helps lower the obstacles to achieving trackable data citations.

    Also, the naming convention for the data package “Data from: [article title]” reinforces the relationship to the article, helping ensure that researchers remain sensitive to the fact that the context provided by the original article may be critical for correct data reuse.

  7. On the slide showing the joint publication and Dryad workflow, what happens to the data sets when the article is rejected for publication? Do the data sets go no further? For the two journals that are requiring data submission as part of the peer review process - what happens to the data if the article is not accepted for publication. Does Dryad still get to keep the data for sharing with other researchers?

    Answer (Jane Greenberg): Most partner journals expect their authors to deposit data only upon article acceptance. For the two journals that have requested that Dryad also host data during peer review, the unpublished data will only be accessible to individuals who have been given a unique code, and the data will be discarded after a limited holding period (likely no longer than a month).
  8. Is access to Dryad data limited by journal subscription?

    Answer (Jane Greenberg): Once published, all metadata and data are in the public domain through application of the Creative Commons Zero waiver. There is no paywall or login impeding access to the public domain content (either metadata or datafiles), though some content may have a time-limited embargo before it is exposed.

    A related question is “what limits Dryad will place on who can deposit?” Authors are free to submit data that is associated with a peer-reviewed article from any journal, with two conditions. One, the author must be registered. Two, while deposit is currently free, in the future that may be the case only when the journal is a subscriber to the repository (see the next question for more information).

  9. For how long is Dryad funded?

    Answer (Jane Greenberg): The primary development grant from the National Science Foundation expires in August 2012, so one critical aspect of the work being funded by that grant is planning for financial self-sustainability. A business model is currently being negotiated among the partner journals in the Dryad Consortium, in which Dryad is to be supported over the long term by subscriptions from journals, societies, and publishers. We expect that Dryad will begin to recover its operational costs starting in January 2012, and already have commitments from a number of partners. For more information, see Neal Beagrie, Lorraine Eakin-Richards, and Todd Vision's, Business Models and Cost Estimation: Dryad Repository Case Study, which is being presented at iPRES in Vienna this September, and Todd J. Vision (2010) "Open Data and the Social Contract of Scientific Publishing," BioScience 60(5):330-330. doi:10.1525/bio.2010.60.5.2. Also accessible under Dryad publications at: https://www.nescent.org/wg_dryad/Publications.

  10. Hi Jane - I wondered what percentage of the total # of authors have donated their datasets. Is it significantly better than the average # of academic authors donating to Digital Repositories?

    Answer (Jane Greenberg): Deposit to Dryad is currently voluntary for all journals. The fraction of articles published by a journal with data in Dryad varies by subdiscipline. This is due, in part, to the fact that not all articles produce data appropriate for Dryad. For example, an article may be purely theoretical and produce no data, or produce data most appropriate for a specialized archive (e.g., Genbank).

    A representative journal, Molecular Ecology, has been inviting authors to voluntarily deposit in Dryad for over 12 months. Authors have submitted data for roughly 25% of the articles; we don’t know what proportion of the remaining 75% have data that is appropriate for Dryad.

    We do expect a drastic increase in the rates of deposit starting in January 2011, at which point journals that have adopted the Joint Data Archiving Policy will require data archiving as a condition of publication.

    Although we are not familiar with specific comparable numbers for institutional data repositories, we know that deposit rates vary for IRs, depending on topical scope and infrastructure support.

  11. What do you know about how the users of Dryad are searching for data sets? You have good descriptive metadata, but you don't have browse by subject/keyword. Why don't you allow other browse facets?

    Answer (Jane Greenberg):
    We are exposing Dryad content through both a website GUI and a growing number of search APIs and metadata exchange services. We don’t know yet which of these will be the primary method for discovering content, so we are balancing development effort between both paths.

    Subject keywords are not yet drawn from or mapped to controlled vocabularies. We plan to support that in the future, at which point we will support browsing by subject keywords. Until that time, users can still use subject keywords in free-text searches.

  12. During the DataCite presentation, micro services were mentioned. What's a micro service?

    Answer (Joan Starr, John Kunze): Micro-services are an approach to digital curation based on devolving curation function into a set of independent, but interoperable, services that embody curation values and strategies. Since each of the services is small and self-contained, they are collectively easier to develop, deploy, maintain, and enhance. Equally as important, they are more easily replaced when they have outlived their usefulness. Although the individual services are narrowly scoped, the complex function needed for effective curation emerges from the strategic combination of individual services.

    Micro-services provide a curation environment that is comprehensive in scope, yet flexible with regard to local policies and practices and the inevitability of disruptive technological change. Micro-services can be deployed in environments in which it makes most sense, both technically and administratively. While UC3 will continue to use micro-services as the basis for its centrally-managed curation activities (for example, the Digital Preservation Repository: http://www.cdlib.org/services/uc3/dpr.html), micro-services can also be operated in local campus environments either individually or in strategic combinations.

    The initial set of micro-services can be grouped into four categories that provide incrementally increasing levels of preservation assurance and curation value. For more information and documentation, see the UC3 Curation wiki (https://confluence.ucop.edu/display/Curation/Home).

  13. What is meant by "lower cost" in the DataCite presentation?

    Answer (Joan Starr, John Kunze): 
    In our presentation, we talked about CDL's new service called EZID. EZID will offer both DOIs and ARKS. Both are persistent identifiers, but they have a few differences. ARKs can be deleted, if necessary, and they are free now and will be in the future. DOIs can't be deleted, and while they are no-cost now on an introductory basis to our early adopters, we will be implementing a cost-recovery (i.e., sustainability) charge at some point in 2011. This is because we are charged for DOIs at the point of registration, and then also on an annual basis.

  14. Can DOIs be duplicated? For example, DataVerse assigns a DOI to each dataset it receives.

  15. How are you all thinking about restricted-use data that contains personally identifiable information? This is a major concern from our standpoint in harm-reduction through data dissemination.

    Answer (Jane Greenberg): 
    Sensitive data (human subjects, endangered species coordinates, etc.) are excluded from Dryad as per journal archiving policies.

  16. Would datasets also include information on collection methodology?

    Answer (Jane Greenberg): 
    This information is typically to be found in the article. Depositors may add more methodological details if they so choose, but it is optional.

  17. I'm interested in further details about how to do the entity extraction / metadata creation step from data sources. Could you explain that further?

    Answer (Jane Greenberg): 
    If this question is directed toward the Dryad project, our approach is to parse and then harvest descriptive metadata content from the author acceptance email, which are copied to Dryad. The metadata recorded in the author acceptance e-mail is used to generate the base-level data package metadata record, and the author can modify this metadata when depositing data files.

    An e-mail template was presented during the Dryad presentation on Aug. 11 [see the slides for more information]. The author acceptance e-mail includes basic bibliographic metadata for an article citation, as well as keywords, abstract, and depositor contact information.

    Selected metadata properties and their content are pushed through the Dryad workflow, and automatically propagated for each data file. Several of the workflow are described in Greenberg, J. (2009), "Theoretical Considerations of Lifecycle Modeling: An Analysis of the Dryad Repository," Cataloging & Classification Quarterly 47 (3/4), DOI: 10.1080/01639370902737547. 

  18. I'd like to know if there's been any evidence that linking the data to the article will cause the journal to increase price? Has there been any research linking journal costs to data linking?

    Answer (Jane Greenberg): 
    It is clear that demand on journals for hosting supplementary data is increasing, and in some cases dramatically. In one extreme case, a journal that we interviewed went from 32 articles with supplementary data in 2000 to 251 in 2009--an increase of 784%. There are several examples of journals that are now charging authors to host supplementary materials ($300 for the Journal of Clinical Investigation, $100 per file in the case of the FASEB Journal).

  19. No one has addressed standards for the data/datasets themselves. Are there formatting standards for Dryad and how do variations in data formats influence use and reuse?

    Answer (Jane Greenberg): 
    Formatting standards would typically be set by the journal. In some fields, like phylogenetics, there are widely used standards. But Dryad is designed to accommodate data even where such standards do not exist or are not widely followed. Experience with other repositories has shown that scientific content and format standards evolve only after communities of practice grow up around archived content.

  20. What is the name of your metadata curation tool?

    Answer (Jane Greenberg): 
    Dryad is a DSpace repository. We are working improve the curation interface and functionalities to better support our needs.

  21. Question for Joel Hammond: you mentioned ICPSR as exemplary, but do you have published requirements for "good enough" data repositories that your published articles would link to?