Empirical Methods in Information Extraction

Article excerpt

Most corpus-based methods in natural language processing (NLP) were developed to provide an arbitrary text-understanding application with one or more general-purpose linguistic capabilities, as evidenced by the articles in this issue of AI Magazine. Author Eugene Charniak and coauthors Ng Hwee Tou and John Zelle, for example, describe techniques for part-of-speech tagging, parsing, and word-sense disambiguation. These techniques were created with no specific domain or high-level language-processing task in mind. In contrast, my article surveys the use of empirical methods for a particular natural language-understanding task that is inherently domain specific. The task is information extraction. Generally, an information-extraction system takes as input an unrestricted text and "summarizes" the text with respect to a prespecified topic or domain of interest: It finds useful information about the domain and encodes the information in a structured form, suitable for populating databases. In contrast to in-depth natural language-understanding tasks, information-extraction systems effectively skim a text to find relevant sections and then focus only on these sections in subsequent processing. The information-extraction system in figure 1, for example, summarizes stories about natural disasters, extracting for each such event the type of disaster, the date and time that it occurred, and data on any property damage or human injury caused by the event.

[Figure 1 ILLUSTRATION OMITTED]

Information extraction has figured prominently in the field of empirical NLP: The first large-scale, head-to-head evaluations of NLP systems on the same text-understanding tasks were the Defense Advanced Research Projects Agency-sponsored Message-Understanding Conference (MUC) performance evaluations of information-extraction systems (Chinchor, Hirschman, and Lewis 1993; Lehnert and Sundheim 1991). Prior to each evaluation, all participating sites receive a corpus of texts from a predefined domain as well as the corresponding answer keys to use for system development. The answer keys are manually encoded templates--much like that of figure 1--that capture all information from the corresponding source text that is relevant to the domain, as specified in a set of written guidelines. After a short development phase,(1) the NLP systems are evaluated by comparing the summaries each produces with the summaries generated by human experts for the same test set of previously unseen texts. The comparison is performed using an automated scoring program that rates each system according to measures of recall and precision. Recall measures the amount of the relevant information that the NLP system correctly extracts from the test collection; precision measures the reliability of the information extracted:

recall = (# correct slot fillers in

output templates) / (# slot

fillers in answer keys)

precision = (# correct slot fillers in

output templates) / (# slot

fillers in output templates)

As a result of MUC and other information-extraction efforts, information extraction has become an increasingly viable technology for real-world text-processing applications. For example, there are currently information extraction systems that (1) support underwriters in analyzing life insurance applications (Glasgow et al. 1997); (2) summarize medical patient records by extracting diagnoses, symptoms, physical findings, test results, and therapeutic treatments to assist health-care providers or support insurance processing (Soderland, Aronow, et al. 1995); (3) analyze news wires and transcripts of radio and television broadcasts to find and summarize descriptions of terrorist activities (MUC-4 1992; MUC-3 1991); (4) monitor technical articles describing microelectronic chip fabrication to capture information on chip sales, manufacturing advances, and the development or use of chip-processing technologies (MUC-5 1994); (5) analyze newspaper articles with the goal of finding and summarizing business joint ventures (MUC-5 1994); and (6) support the automatic classification of legal documents (Holowczak and Adam 1997). …