A Comparison of Term Clusters for Tokenized Words Collected from Controlled Vocabularies, User Keyword Searches, and Online Documents

By Nowick, Elaine A.; Travnicek, Daryl et al. | Library Philosophy and Practice, November 2010 | Go to article overview

A Comparison of Term Clusters for Tokenized Words Collected from Controlled Vocabularies, User Keyword Searches, and Online Documents


Nowick, Elaine A., Travnicek, Daryl, Eskridge, Kent, Stein, Stephen, Library Philosophy and Practice


Introduction

The goal of all libraries, whether housed in buildings or online, is to guide users to documents appropriate to their information needs. The library system ideally provides a bridge between the information need and the appropriate information source. Traditionally, the card catalog or a bibliographic index acted as such a bridge. Users could consult works assigned an appropriate subject from a controlled vocabulary or thesaurus such as the Library of Congress Subject Headings (LCSH). Listed under the subjects were call numbers or citations that directed the user to the physical location of the work. As library catalogs and journal article indexes moved to electronic formats, other points of intellectual access to documents became available. Keyword searching could match a user's term to an identical term anywhere in the online record. Subject headings became less important to users although they still offered advantages to keywords in some cases. With controlled vocabularies all works on a given topic will have the same subject heading. Because controlled vocabularies are assigned by humans, synonymy and homonymy can be dealt with, and humans can deduce the implied but unstated subject of a document. In addition, controlled vocabularies can provide a hierarchical outline to serve as both a physical and mental map of a subject area. When users can see the choices of subjects as with a card catalog they can select among the available terms and they can browse through the catalog or the shelf. However, with online subject searching , users need to know the terms that the controlled vocabulary employs to locate their document. An exact match is needed to retrieve the citations. The opportunity to locate information by browsing is often missing in online library catalogs. Other shortcomings to controlled vocabularies are that assignment of subject headings is a relatively slow process, is labor intensive, and there is inconsistency even among experienced catalogers; the controlled vocabulary may use terms that are not familiar to the users; and the process of adopting standardized vocabularies is not responsive to rapidly changing fields.

As documents began to be published online on the internet, some libraries attempted to assign controlled terms to them, but it soon became obvious that the task was overwhelming. Some kind of organization of the information on the internet would be helpful since simple keyword matching of user search terms to words in a document through search engines often produces far more "hits" than any one individual could scan through. The percentage of relevant documents, the so-called precision of the search, can often be low despite the volume of results produced. Attempts to get authors to embed descriptive metadata in HTML or XML coding in their documents has not been highly successful (Brin and Page, 1998, Nowick, 1997).

Digital documents do offer new organizational methods that could fully automate or at least aid human catalogers and indexers. One such tool is text analysis. There are a number of programs available that will list all words in a digital document along with their frequencies. These "tokenized" terms can then be used as document descriptors in lieu of subject headings or can be used to suggest headings to human catalogers. Terms can also be statistically analyzed through cluster analysis or other methods to create an outline of subjects for an online library, allowing users to browse through the collection (Jain et al., 1999). There have been a number of studies focused on applying and refining cluster analysis of term lists generated by text analysis from document collections. Sebastiani (2003) has reviewed the state of automated text categorization up until 2002.

Some of these studies have attempted to tie individual word clusters back to terms in a controlled vocabulary(Wu et al., 2006). Nikravesh (2008) suggested labeling clusters identified through cluster analysis of documents with rule-based concept terms from a controlled vocabulary. …

The rest of this article is only available to active members of Questia

Already a member? Log in now.

Notes for this article

Add a new note
If you are trying to select text to create highlights or citations, remember that you must now click or tap on the first word, and then click or tap on the last word.
One moment ...
Default project is now your active project.
Project items
Notes
Cite this article

Cited article

Style
Citations are available only to our active members.
Buy instant access to cite pages or passages in MLA 8, MLA 7, APA and Chicago citation styles.

(Einhorn, 1992, p. 25)

(Einhorn 25)

(Einhorn 25)

1. Lois J. Einhorn, Abraham Lincoln, the Orator: Penetrating the Lincoln Legend (Westport, CT: Greenwood Press, 1992), 25, http://www.questia.com/read/27419298.

Note: primary sources have slightly different requirements for citation. Please see these guidelines for more information.

Cited article

A Comparison of Term Clusters for Tokenized Words Collected from Controlled Vocabularies, User Keyword Searches, and Online Documents
Settings

Settings

Typeface
Text size Smaller Larger Reset View mode
Search within

Search within this article

Look up

Look up a word

  • Dictionary
  • Thesaurus
Please submit a word or phrase above.
Print this page

Print this page

Why can't I print more than one page at a time?

Help
Full screen
Items saved from this article
  • Highlights & Notes
  • Citations
Some of your highlights are legacy items.

Highlights saved before July 30, 2012 will not be displayed on their respective source pages.

You can easily re-create the highlights by opening the book page or article, selecting the text, and clicking “Highlight.”

matching results for page

    Questia reader help

    How to highlight and cite specific passages

    1. Click or tap the first word you want to select.
    2. Click or tap the last word you want to select, and you’ll see everything in between get selected.
    3. You’ll then get a menu of options like creating a highlight or a citation from that passage of text.

    OK, got it!

    Cited passage

    Style
    Citations are available only to our active members.
    Buy instant access to cite pages or passages in MLA 8, MLA 7, APA and Chicago citation styles.

    "Portraying himself as an honest, ordinary person helped Lincoln identify with his audiences." (Einhorn, 1992, p. 25).

    "Portraying himself as an honest, ordinary person helped Lincoln identify with his audiences." (Einhorn 25)

    "Portraying himself as an honest, ordinary person helped Lincoln identify with his audiences." (Einhorn 25)

    "Portraying himself as an honest, ordinary person helped Lincoln identify with his audiences."1

    1. Lois J. Einhorn, Abraham Lincoln, the Orator: Penetrating the Lincoln Legend (Westport, CT: Greenwood Press, 1992), 25, http://www.questia.com/read/27419298.

    Cited passage

    Thanks for trying Questia!

    Please continue trying out our research tools, but please note, full functionality is available only to our active members.

    Your work will be lost once you leave this Web page.

    Buy instant access to save your work.

    Already a member? Log in now.

    Search by... Author
    Show... All Results Primary Sources Peer-reviewed

    Oops!

    An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.