Report on the Workshop on Electronic Thesauri
November 4-5, 1999

Details on the Workshop on Electronic Thesauri



NISO/APA/ASI/ALCTS



Report by Jessica Milstead

Background and Planning

NISO (The National Information Standards Organization), APA (The American Psychological Association); ASI (The American Society of Indexers); and ALCTS (Association for Library Collections and Technical Services) sponsored an invitational workshop on November 4-5, 1999 in Washington, DC to investigate the desirability and feasibility of developing a standard for electronic thesauri. The review of ANSI/NISO Z39.19-1993(R1998) standard (Guidelines for the Construction, Format, and Management of Monolingual Thesauri) recommended such an investigation.

 

The workshop was planned by a Planning Committee consisting of Joseph Busch, Datafusion; Peter Ciuffetti, KnowledgeCite Library; Margie Hlava, Access Innovations (and SLA); Gail Hodge, IIa (and ASIS); Nancy Knight, NISO; Kate Mertes, RIAG (and ASI); Jessica Milstead, JELEM (organizer); Stuart Nelson, NLM; Gertrude Ostrove, LC; Diane Vizine-Goetz, OCLC; and Joyce Ward, Northern Light.

 

The Planning Committee met in teleconference on Tuesday, August 10, 1999, and agreed on scope, topics to be covered, and format of the workshop.


Workshop Scope

The definition of "thesaurus" for purposes of this meeting was broader than that of the present standard for thesauri ANSI/NISO Z39.19-1993 (R1998). The meeting considered vocabularies that meet two basic criteria:

use to facilitate analysis of texts and their subsequent retrieval (or retrieval of the information which they contain); and inclusion of a rich set of semantic relationships among their constituent terms.

The scope included (among others): standard thesauri, subject heading lists, semantic networks, and taxonomies (Internet directories). It excluded: simple term lists, with or without equivalence relationships; lists of terms whose only relationship is that of co-occurrence in documents; and lists of terms whose primary purpose is to provide definitions (e.g., dictionaries and glossaries)


Key Issues Considered

The committee identified four key issues:

* The need for (and feasibility of developing) a standard that speaks to criteria and/or methods for generating thesauri by machine-aided or automatic means

* The need for (and feasibility of developing) a standard set of tools which show semantic relationships among terms, as aids to text and information analysis and retrieval.

* The need for (and feasibility of developing) a standard structure that supports a variety of electronic thesaurus displays.

* The need for (and feasibility of developing) a standard that supports interoperability protocols, structures, and/or semantics applicable to thesauri.

 

Each issue was to be addressed by an invited presenter, who would provide a brief introduction to the importance of the issue, followed by lengthy discussion by the group.

 

In addition, four secondary issues were identified, to be discussed in breakout sessions and reported back to the full meeting:

* Prescriptions for term structure

* Schemata for relationships (not limited to those of the present standard)

* Structures to support both vocabulary control and vocabulary management

* Guidance on incorporating leaf nodes

The Meeting

Approximately 65 participants, representing primary and secondary publishers, information services, an online retailer, and libraries attended the meeting. James Anderson of Rutgers University served as keynoter, providing an overview of the standard-setting process, and of the need for standards for electronic thesauri. Margie Hlava served as moderator, keeping the meeting on track, and Jessica Milstead served as recorder.

 

The four issues were presented by Joyce Ward, Dagobert Soergel, Eric Johnson, and John Kunze, respectively. As the lively discussion progressed, the group decided that it was better to dedicate the breakouts to the four major issues that were under discussion, rather than to the secondary issues that were originally planned. Guidance on most aspects of the secondary issues was emerging in any case.

 

After the four issue presentations and discussion, the breakout sessions were held over lunch on November 5. Returning in early afternoon, a recorder presented the findings of each session, supplemented by input from other participants in the session. These findings were followed by general discussion and development of recommendations to NISO.

 

Summary of the Discussions

The discussions following each issue presentation and during the wrap-up were wide-ranging. To avoid repetitiveness, this summary first presents some general themes which recurred throughout the presentations and discussions. It then lists points that were made by speakers and participants for each topic and presentation. Recommendations follow at the end of the report.

 

While the call for the workshop used the word "thesauri," it was found that this term was too constricting; its use tended to constrict thinking to Z39.19 standard and similar thesauri. It is important to keep in mind that the emphasis of the workshop was on controlled/managed vocabularies in general.

General Themes from the Presentations and Discussions

* Interoperability, shareability, reusability are critical. One aspect of shareability is the need for shareable input files, which is not the case with thesaurus management software today.

* A standard interchange format is required, one that will support displays at a variety of levels.

* Scope and definition of "thesaurus"? This was resolved in favor of inclusiveness, including any kind of vocabulary, regardless of its structure and organization. However, broad classification schemes should be distinguished from thesauri with tens of thousands of terms.

* A hierarchy of attributes is needed, permitting a query to be degraded back to a simpler level (e.g., author or variant) if the precise one (e.g., corporate author or abbreviation) is not supported by a particular database.

* Making the standard simple and thesaurus building easier is also critical.

* The concern is with concepts, but these are expressed by means of terms.

* Vocabulary mapping: demand is increasing, but this is not a trivial exercise. Mapping terms is not equivalent to mapping concepts, and mapping may result in development of a lowest common denominator - (Blosing some of the richness of the individual vocabularies.

* The goal is to enable users (searchers, indexers, lexicographers) to cross the boundaries from one thesaurus to another.

* The standard needs to address terminology and definitions (e.g., what is a term, a concept, a word).

* The standard committee should include not just thesaurus builders, but retrieval system vendors, software designers (vendors), and a knowledgeable user. Also include music and music video industries, retail directories, the various kinds of vocabularies (thesauri, classifications, ontologies, taxonomies, subject headings)


Jim Anderson*s Keynote

* Nouns have been preferred to date for traditional thesauri. Given the needs of natural language, we should move beyond this limitation.

* There is a need to be more open about what constitutes a "bound" or "compound" term, since what constitutes a single concept may vary depending on the domain.

* Different kinds of organizations need different kinds of relationships; a large number of kinds of relationships are used in natural language processing (NLP).

* Research has shown a need to make many variants and synonyms available to assure retrieval of concepts.

* Negative research that appears to show that use of a thesaurus has harmed retrieval usually reflects misuse of the thesaurus, not problems with the thesaurus as such.

* The Dublin Core is a potential model for shareability, from the point of view of its separation of semantics from syntax. Agreeing on the semantics may be more practical for different communities.

* The goal is not simply to express the present standard in electronic form. Tools for natural language processing may be richer than want can be expressed in a standard thesaurus.

* The need for use of multiple vocabularies is increasing.

* Incompatible relationships cause problems in mapping vocabularies.


Issue 1 (Joyce Ward, Northern Light)

The need for (and feasibility of developing) a standard that speaks to criteria and/or methods for generating thesauri by machine-aided or automatic means.

* Is it feasible to develop generally useful methods for determining meaningful co-occurrences of terms, or are they domain-specific?

* A standard representation of a term should be specified, in order to permit normalization.

* Standard thesauri for companies, products, etc., would be desirable.

* Stemming methods are another problem; some software make assumptions that cause problems for morphological rules.

* Who are the users both of the potential standard and of any vocabularies that are created?

* Desirable relationships may be different for indexers and end users.


Issue 2 (Dagobert Soergel, University of Maryland)

The need for (and feasibility of developing) a standard set of tools which show semantic relationships among terms, as aids to text and information analysis and retrieval

* A standard list of relationships is needed, probably hierarchical, permitting specific types to be mapped to generic relationship types. An ALA subcommittee reviewed the literature and produced a report on relationship types that presently exist (see Resources section below). The AAT has also done some work on types of associative relationships. UMLS labels some relationships with attributes.

* The culture of vocabulary development assumes a finite set of relationships, but an unbounded set of terms. However, this does not necessarily have to be the case.

* Specialized or integrated tools for semantic relationships are required.

* Retrieval goals are various: expansion or focusing of a topic, finding items that are related to a topic in different ways, or documents that treat the topic from a particular perspective or at a different level.

* Can semantic networks and relationships be the same across disciplines, or should they vary? Current relationships appear to be dominantly those of engineering. Humanities might be different.

* Relationships differ from culture to culture

* Even when terms are mapped successfully, different relationship structures need to be available.

* The model of multilingual thesauri is appropriate for thesaurus mapping.

* Semantic relationships must accommodate both concepts and terms.

* A registry of thesaurus metadata, including but not limited to relationship types, was suggested. Existence of the NKOS (Networked Knowledge Organization Structures) group and its list was brought out. (See Resources Noted section below.)

* Three levels of metadata: format for describing relationship types, set of relationship definitions that follows the format, and a knowledge base of concepts, terms, and relationships. Recognize, however, that present-day metadata formats have not gone beyond "Keyword" in providing for subject information.


Issue 3 (Eric Johnson, University of Illinois)

The need for (and feasibility of developing) a standard structure that supports a variety of electronic thesaurus displays

* Displays should use an open, widely supported format that is useful for different browsing tools. XML was suggested, but it was pointed out that a more abstract level might also be needed, e.g. UML. RDF was also favored. Conclusion: the standard should give general principles, because the specific implementation will probably change.

* Displays should augment indexing and retrieval tools.

* They should be available to agent-based and distributed-object systems

* A "soft" standard should cause developers to give full consideration to computer-displayed thesauri; some paper-type displays are not needed in the electronic format.

* A "hard" standard covers data needed for the displays, so that thesauri can be used by any thesaurus browser.

 

 

Issue 4 (John Kunze, University of California, San Francisco)

The need for (and feasibility of developing) a standard that supports interoperability protocols, structures, and/or semantics applicable to thesauri

* There are multiple kinds of protocols for different purposes, including searching, exchange (mirroring), updating/editing, and automatic subject generation.

* The critical aspect of Z39.50 is the common medium-level semantics, which can be very difficult at the application level.

* The Zthes profile can be used in an XML environment, not just with Z39.50.

* The problem is to make a common interface to a set of thesauri without "dumbing down" to the lowest common denominator.

* Cross references between thesauri, or metathesauri, need to be considered.

* Web browsers are not thesaurus-aware.

* Existing metadata formats do little with thesauri; if the number of values is large, they simply point to other authorities.

* The standards process in the Internet culture is a useful model. Anyone can publish an IETF draft that is then archived for six months. If supported, the draft can be progressed to a Request for Comments, which really is a standard of sorts.

* "Invisible" use of the thesaurus by a search system needs to be considered, where the search engine takes the user directly to results without showing the route used to get there. This invisibility confuses users.

Reports from the Breakout Sessions


Four breakout groups then met to discuss Term structure; Relationships; Vocabulary control; and Display. Recurrent themes from the breakout sessions are covered under the Summary of the Discussions section above; comments specific to each breakout session follow below.


TERM STRUCTURE (Stuart Nelson)

* Warrant, including maintenance practices, and domain should be stated clearly for a vocabulary.

* Thesaurus metadata should be addressed, with refinement of the definitions in Z39.19.

* An update model for changes is required, e.g., to show when a change is an error correction, slight change in meaning, split of one concept into two, etc.

* The notation of a classification is simply another term representing a given concept.


RELATIONSHIPS (Diane Vizine-Goetz)

* A standard metadata schema is required.

* One or more schemata for relationships is needed. Different schemata may be needed for hierarchical and non-hierarchical vocabularies.

* There should be a core set of relationships, hierarchically organized. A minimal set should not be developed because it might vary by type of scheme. Each scheme should declare which relationship types it is using.

* Look beyond semantic relationships to other communities and types of relationships, e.g., bibliographic or syntactic. The recommendation is to look to other communities, not necessarily to broaden the standard.

* The schema could be a revision of Z39.19, an appendix, or a new standard. No recommendation made for this aspect.

* An open registry of relationship types should be built around the core set.


VOCABULARY CONTROL (Gail Hodge)

* Maintain consciousness of limitations of both human and financial resources.

* Include vocabulary management, not just control, and identify the management mechanism.

* Provide source information to aid in disambiguation of the senses in which a term is used.

* Permit different relationship lists because domain groups may need their own.

* Label and standardize data elements.

* Accommodate visual/graphical thesauri.

* Elements of time and space should be accommodated, e.g., via an audit trail for terms, relationships, and concepts.

* Extend the current standard rather than replacing it.


DISPLAY (Eric Johnson, University of Illinois)

* A variety of flexible displays should be accommodated, to meet the needs of many uses and users.

* Collect examples of displays in current use.

* When the present standard was being developed, extensive information on interfaces and displays was gathered. This information should be reviewed as background for the new work.

* The need for content of displays varies from minimal to literally everything in a record.

* Metrics (e.g., number of relationships) and metadata of the thesaurus should be available for display.


The Recommendations

The following recommendations were developed by consensus of the group at the end of the workshop.

* A new standard for "thesauri" is needed, and it should be a single standard.

* However, it should not be a standard for "electronic" thesauri. Essentially all thesauri are digital today, so "electronic" is superfluous.

* Furthermore, the standard should provide for a broader group of controlled vocabularies than those that fit the standard definition of "thesaurus." This includes, for example, ontologies, classifications, taxonomies, and subject headings, in addition to standard thesauri.

* The primary concern is with shareability (interoperability), rather than with construction or display. Therefore this new standard probably will not supersede Z39.19, but supplement it.

* The standard should focus on concepts, terms and relationships.

* The structure of the vocabulary has two aspects: structure of terms and relationships, and structure of the vocabulary as a whole.

* Displays illustrate and aid in testing adequacy of a standard, but have nothing to do with the standard per se. Examples may be included in an appendix.

 

Resources Noted

Following is a list of resources that were mentioned during the workshop as having value:

* The ALA committee work. Report:

http://www.ala.org/alcts/organization/ccs/sac/rpt97rev.html. Taxonomy of relationships: http://www.ala.org/alcts/organization/ccs/sac/msrscu2.pdf

* Collection of thesauri and special classifications at the University of Toronto

* Group in Germany doing similar work with electronic thesauri

* Metadata registries group in IEEE.

* Sources of definitions include the current standard, a revision of Hans* Wellisch*s glossary which is underway, the ASIS Thesaurus, and ISO groups.

* NKOS (Networked Knowledge Organization Systems): An informal group with a listserv. The group meets several times a year in conjunction with appropriate professional meetings. To join NKOS send a message to Linda Hill at: lhill@alexandria.ucsb.edu. She will respond with information on participating in the group.


Copyright 1999 National Information Standards Organization