Academic journal article Journal of Management Information and Decision Sciences

Semi-Automated Identification of Faceted Categories from Large Corpora

Academic journal article Journal of Management Information and Decision Sciences

Semi-Automated Identification of Faceted Categories from Large Corpora

Article excerpt


This paper describes FFID (Fast Facet Identifier), a system that can be used to compute facets from a corpus of documents. FFID uses a fast simplified clustering algorithm that allows the identification of hundreds of facet clusters from a corpus of hundreds of thousands of sentences in a very short time (seconds). The automatic identification of facets may be a very powerful tool to design better information retrieval systems. The goal of information retrieval is to support people in searching for the information they need. Given an information problem, finding relevant (let alone high quality) documents is difficult. The sheer amount of information available on line makes this a difficult problem. The size of the web is debatable (Markoff, 2005) but it must be by now at least 12,000 million pages. If each one of these web pages were printed on a standard A4 sheet of paper (21-cm wide), and put side to side on a straight line, it would take about 60 earth circumferences to lay them all down. This is a lot of information. People learn about their information problem and about the information resource they are using through interaction with the resource. Human computer interaction is the crucial phenomenon of the information retrieval process. Fast algorithms, hardware for storage and processing, data and knowledge structures are important but useless if we do not understand how humans interact with machines when looking for information. All the techniques we use must first take into account what we are doing this for: the user. Users encounter several problems when they approach an information resource:

(1) Users seldom understand their information problem. Belkin and Croft define information need as a problematic situation where a person cannot attain their goals due to lack of resources or knowledge (Belkin and Croft, 1992).

(2) Users cannot articulate their problem, they need help constructing and refining queries. In his classic and still relevant paper from 1968 Taylor (Taylor, 1968) argues that users might have a vague information need, but it may not be clear enough for them to articulate it. Belkin's "anomalous state of knowledge" (ASK) hypothesis (Belkin, 1980) proposes that when users encounter a problematic situation, the resulting cognitive uncertainty makes it difficult for them to adequately expressing their information need.

(3) Users do not know whether what they are looking for may or may not be in the collection they are searching (Hearst, 1999).

(4) Users may not be aware that there are other interpretations for their questions (Venkatsubramanyan and Perez-Carballo, 2007).

(5) Users may not be familiar with the information resource's user interface (UI) (Xie and Cool, 2009).

(6) Users may not be able to recognize that an item is useful or relevant even after it is presented to them (Xie and Cool, 2009).

(7) The relevant documents may be lost in a large number of results returned by the information resource (Xie and Cool, 2009).

The problems listed above make it desirable to have an interface capable of supporting browsing or exploration. Or like one of my colleagues used to tell his students: "browsing is what you want to do when you don't really know what you want!" (James Doig Anderson, personal communication).

Several studies (Hearst, 2006; Venkatsubramanyan & Perez-Carballo, 2007), suggest that interfaces that present results organized into categories or faceted hierarchies meaningful to users may help them make sense of their information problem as well as the information system itself.

There are several open problems with the design of UIs that present organized results including how to generate useful groupings and how to design interfaces that can use them effectively. For a survey of such interfaces see Venkatsubramanyan & Perez-Carballo, 2007.

Two common ways of generating groupings are: document clustering and facet categorization. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed


An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.