Home | Public Area

Comment #00172 - Support for an independent Collection XML - z39-96-dsftu-final.pdf

Comment 172
New (Unresolved)
NISO Z39.96-201x, JATS: Journal Article Tag Suite (Draft Standard for Trial Use) (Revision 0)
Comment Submitted by
Nikos Markantonatos
2011-09-30 03:31:49
The current article model describes a set of metadata which not only encodes information about the article itself, but also attempts to capture the context of where, how and by whom the article was published. It also assumes that the article was published at most once, since it only allows for at most one instance of <journal-meta> inside <front> and for at most one instance of <volume> inside <article-meta>.

Furthermore, a closer inspection of the <article-meta> model reveals a number of metadata elements which do not relate with the article itself, but to the volume and/or to the issue where this article was published under. Examples of such elements are <volume>, <volume-id>, <volume-series>, <issue>, <issue-id>, <issue-sponsor>, <issue-title>, <issue-part>, <isbn>, etc. Finally, the entire <journal-meta> model defined under the article <front> element describes metadata about the publication that hosts the article (<journal-id>, <journal-title-group>, <issn>, etc) and the publisher that publishes it (<publisher-name>, <publisher-loc>).

An immediate implication of this logic is that articles belonging to the same issue repeat a lot of their metadata information, since that belongs to the journal, volume or issue and not to the article. Repetition is prone to errors and metadata inconsistencies are hard to resolve.

On the other hand, when an article is published in two places simultaneously, such as for example published in a printed issue and at the same time placed in an electronic-only collection, the article XML fails to capture metadata about one of its publishing instances, owing to the above limitation in the model. An attempt to enhance the current model to allow for multiple publication instances in the <article> model will require a major restructuring of <article-meta> to extract all volume and issue level information out of its scope.

Another limitation with the current association of publisher, journal, volume and issue metadata with the article is the fact that information associated with these higher-level publication entities cannot be encoded in the article XML. Examples of such information form a) a PDF for the entire issue, b) an index for the issue, c) a PDF for the Table of Contents, d) a pointer to the "instructions to the authors" section, e) the issue DOI and self URI, f) a link to all issue ads, g) links to issue-level supplementary material, h) the issue cover image and its associated caption, i) links to related issues or publications.

Finally, information for the encoding of the Table of Contents and its associated elements lies dispersed across the article XMLs of an issue rather than be isolated in a single place which lists all articles participating in the Table of Contents along with headings, page numbers, issue titles, etc.
Submitter Proposed Solution
The proposed solution which attempts to address all problems mentioned above is the generation of an independent XML file root which maintains all information about a collection of articles. For the purposes of this proposal, I will term this new file the Collection XML. The Collection XML would then be the source of information for:

1) journal and publisher metadata currently under <journal-meta>,
2) volume and issue metadata currently under <article-meta>,
3) list of articles included in the collection
4) additional information pertaining to the collection
5) table of contents information
6) alternative table of contents for the collection

The proposal calls for an optional Collection XML for each collection of articles. Such collections can range from early articles published ahead-of-print for one or more journals, articles included in a printed issue, an electronic-only collection of arbitrary articles on a topic, the collection of articles that form a conference proceedings or any other collection for which we wish to maintain explicit metadata for.

Each Collection XML records items #1-#6 above. Items #1 and #2 help remove the undesirable metadata redundancy from article XMLs and define an authoritative source for these metadata. Note that although it is possible to remove redundant metadata from article XMLs, it is not necessary to do so, if one wishes to preserve self-contained article XMLs.

Item #3 positively identifies the list of articles included in a collection without resorting to a costly exhaustive search. Item #4 associates this collection with a number of additional information related to it, such as those elements (a) through (i) mentioned above. Item #5 helps maintain all additional information that is necessary for rendering a Table of Contents, such as article entries sequence, page ranges, nested headings, annotations, graphics, etc.

Item #6 proves useful for collections where there is need for additional Tables of Contents driven by requirements to render the collection contents in multiple facets or in multiple languages.

For a more complete description of the Collection XML suggested here and a concrete model for its implementation, please refer to JATS-Con 2011 paper "Article vs Issue XML: Capturing the Table of Contents under the NLM DTD" at

http://www.ncbi.nlm.nih.gov/books/NBK57236/

Note that what is referred to in that paper as Issue XML is the exact same concept as the Collection XML suggested in this proposal.