Using a Reference Corpus as a User Model for Focused Information Retrieval

Article excerpt

ABSTRACT. We propose a method for ranking short information nuggets extracted from a text corpus, using another, reliable reference corpus as a user model. We argue that the availability and usage of such additional corpora is common in a number of IR tasks, and apply the method to answering a form of definition questions. The proposed ranking method makes a substantial improvement in the performance of our system.

Categories and Subject Descriptors

H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing--linguistic processing; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval--information filtering, search process; H.3.4 [Information Storage and Retrieval]: Systems and Software--question-answering (fact retrieval) systems; 1.2.1 [Artificial Intelligence]: Applications and Expert Systems; 1.2.7 [Artificial Intelligence]: Natural Language Processing

General Terms

Information Retrieval

Keywords: Question Answering, Information Retrieval


The area of Question Answering (QA) is at the focus of a lot of research interest lately, both in the Information Retrieval (IR) community and among Computational Linguists. It is seen as one of the few applications to successfully combine techniques from Natural Language Processing and IR. The QA track at the annual Text REtrieval Conferences (TREC, [20]) has become an important factor in shaping and giving direction to QA research. Introduced in 1999, this track attracts a significant number of participants each year, and provides a focal point for much modern QA research. When the QA track at TREC was introduced, it focused on so called "factoid" questions (typically having a short named entity as an answer) such as How many people live in Tokyo? or When is the Tulip Festival in Michigan?. As the track evolved, it was argued that this type of questions does not accurately model the needs of real users of QA technology. In addition to named entities as answers, users often search for definitions of concepts, or for summaries of important information about them. As a result, in 2003 TREC introduced definition questions--questions for which the answer is not a single named entity, but a list of information nuggets [19]. In the TREC 2004 QA track this was taken a step further. The questions were now clustered in small groups, organized around the same topic. For example, the topic Concorde included questions such as How many seats are in the cabin of a Concorde? and What airlines have Concordes in their fleets?. Finally, for every topic, the track guidelines required participants to supply "additional important information found in the corpus about the target, that was not explicitly asked." This last requirement has been dubbed "other" questions [20]. In our view, the task presented at the TREC 2004 QA track, and the introduction of the "other' questions makes a big step towards more realistic user scenarios. According to our own analysis of web query logs, users tend to ask much more "knowledge gathering" questions than factoid questions about specific facts. (1)

This new type of "other" questions puts more emphasis on the user aspect in the QA process--an issue that has mostly been neglected in the QA community. The TREC criteria for what is a good answer to a given question has so far been rather vague, but QA systems dealt with this vagueness fairly effectively for factoid questions. With the "other" questions, where systems are required to return only important information, there is an implicitly assumed user model that can discriminate between important and unimportant facts about a topic. For example, for the topic Clinton, his birthday might be considered important, while the day of the week when he left Mexico probably is not. In order to give reasonable responses to "other" questions, a QA system needs to model such preferences.

We present an approach for answering "other" questions using an explicit user model. …