Academic journal article Information Technology and Libraries

A Simple Scheme for Book Classification Using Wikipedia

Academic journal article Information Technology and Libraries

A Simple Scheme for Book Classification Using Wikipedia

Article excerpt

Because the rate at which documents are being generated outstrips librarians' ability to catalog them, an accurate, automated scheme of subject classification is desirable. However, simplistic word-counting schemes miss many important concepts; librarians must enrich algorithms with background knowledge to escape basic problems such as polysemy and synonymy. I have developed a script that uses Wikipedia as context for analyzing the subjects of nonfiction books. Though a simple method built quickly from freely available parts, it is partially successful, suggesting the promise of such an approach for future research.


As the amount of information in the world increases at an ever-more-astonishing rate, it becomes both more important to be able to sort out desirable information and more egregiously daunting to manually catalog every document. It is impossible even to keep up with all the documents in a bounded scope, such as academic journals; there were more than twenty-thousand peer-reviewed academic journals in publication in 2003. (1) Therefore a scheme of reliable, automated subject classification would be of great benefit.

However, there are many barriers to such a scheme. Naive word-counting schemes isolate common words, but not necessarily important ones. Worse, the words for the most important concepts of a text may never occur in the text.

How can this problem be addressed? First, the most characteristic (not necessarily the most common) words in a text need to be identified--words that particularly distinguish it from other texts. Some corpus that connects words to ideas is required--in essence, a way to automatically look up ideas likely to be associated with some particular set of words. Fortunately, there is such a corpus: Wikipedia.

What, after all, is a Wikipedia article, but an idea (its title) followed by a set of words (the article text) that characterize that title? Furthermore, the other elements of my scheme were readily available. For many books, Amazon lists Statistically Improbable Phrases (SIPs)--that is, phrases that are found "a large number of times in a particular book relative to all Search Inside! books." (2) And Google provides a way to find pages highly relevant to a given phrase. If I used Google to query Wikipedia for a book's SIPs (using the query form "site:en.wikipedia .org SIP"), would Wikipedia's page titles tell me something useful about the subject(s) of the book?

* Background

Hanne Albrechtsen outlines three types of strategies for subject analysis: simplistic, content-oriented, and requirements-oriented. (3) In the simplistic approach, "subjects [are] absolute objective entities that can be derived as direct linguistic abstractions of documents." The content-oriented model includes an interpretive step, identifying subjects not explicitly stated in the document. Requirements-oriented approaches look at documents as instruments of communication; thus they anticipate users' potential information needs and consider the meanings that documents may derive from their context. (See, for instance, the work of Hjorland and Mai. (4)) Albrechtsen posits that only the simplistic model, which has obvious weaknesses, is amenable to automated analysis.

The difficulty in moving beyond a simplistic approach, then, lies in the ability to capture things not stated, or at least not stated in proportion to their importance. Synonymy and polysemy complicate the task. Background knowledge is needed to draw inferences from text to larger meaning. These would be insuperable barriers if computers limited to simple word counts. However, thesauri, ontologies, and related tools can help computers as well as humans in addressing these problems; indeed, a great deal of research has been done in this area. For instance, enriching metadata with Princeton University's WordNet and the National Library of Medicine's Medical Subject Headings (MESH) is a common tactic, (5) and the Yahoo! …

Search by... Author
Show... All Results Primary Sources Peer-reviewed


An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.