Academic journal article Journal of Digital Information Management

Hierarchical Topic Detection in Large Digital News Archives: Exploring a Sample Based Approach

Academic journal article Journal of Digital Information Management

Hierarchical Topic Detection in Large Digital News Archives: Exploring a Sample Based Approach

Article excerpt

ABSTRACT. Hierarchical topic detection is a new task in the TDT 2004 evaluation program, which aims to organize a collection of unstructured news data in a directed acyclic graph (DAG) structure, refecting the topics discussed in the collection, ranging from rather coarse category like nodes to fine singular events. The HTD task poses interesting challenges since its evaluation metric is composed of a travel cost component reflecting the time to find the node of interest starting from the top node and a quality cost component, determined by the quality of the selected node. We present a scalable architecture for HTD and compare several alternative choices for agglomerative clustering and DAG optimization in order to minimize the HTD cost metric. The alternatives are evaluated on the TDT3 and TDT5 test collections.

Categories and Subject Descriptors

H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval-Clustering; 1.5.3 [Pattern recognition]: Clustering--Algorithms, Similarity measures

Keywords

Information Retrieval, Hierarchical Topic Detection, TDT

1. INTRODUCTION

The Topic Detection and Tracking (TDT) project is an annually held evaluation study in the ffeld of TDT organized by the National Institute of Standards and Technology(NIST).

TDT has included a Topic Detection task since its inception in 1996. In this task systems are required to organize news stories in clusters, corresponding to the topics discussed. The result can be regarded as a partition of the corpus, in which each news item is assigned to one and only one partition representing a topic.

The systems are scored by comparing the system result to a manually composed ground truth. The cost of a (cluster) structure defines the 'distance' to the ground truth; a better structure has a lower cost. The ground truth is composed by annotators of the Linguistic Data Consortium and consists of manually labelled clusters containing news stories discussing a particular topic. A topic is defined as an event or activity, along with all directly related events and activities. The topics are selected from a random sample of documents from the corpus. The annotation is search guided, i.e. the related stories are found using a search engine. Important to mention is that the annotation for the most recently published TDT 5 corpus is incomplete, that is, there will be no guarantee that every story on each topic has been located [5]: The search for stories related to one particular topic is ceased after 3 hours, in contrast to previous annotations where the annotators decided when all on-topic stories were found.

The Task Definition and Evaluation Plan of TDT 2004 [6] describes two reasons for introducing a new Hierarchical Topic Detection task. The first shortcoming is that a flat partitioned structure does not allow a single news item to belong to multiple topics.

Furthermore a flat structure does not allow multiple levels of granularity, i.e. topics cannot be introduced at various levels of detail.

The new HTD task enables stories to be assigned to multiple clusters. Furthermore clusters may be a subset of, or overlap with other clusters. The resulting structure must be characterizable as a DAG with a single root node. The root node represents the complete document collection whereas child clusters further down the DAG represent more specific subsets comprising finer detailed topics. For this initial trial evaluation, the task simplifies treatment of time: the task is treated as retrospective search, i.e. the documents may be processed in any order, in contrast to the old task in which the items should be processed in the order they were published [6].

The metric used for the old Topic Detection task is not suitable for this new task. Allan et al [1] discuss various methods for evaluating hierarchical cluster structures. The TDT 2004 HTD task is evaluated by using the minimal cost method described in Allan et al's paper. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.