Empirical Methods in Information Extraction

By Cardie, Claire | AI Magazine, Winter 1997 | Go to article overview

Empirical Methods in Information Extraction


Cardie, Claire, AI Magazine


Most corpus-based methods in natural language processing (NLP) were developed to provide an arbitrary text-understanding application with one or more general-purpose linguistic capabilities, as evidenced by the articles in this issue of AI Magazine. Author Eugene Charniak and coauthors Ng Hwee Tou and John Zelle, for example, describe techniques for part-of-speech tagging, parsing, and word-sense disambiguation. These techniques were created with no specific domain or high-level language-processing task in mind. In contrast, my article surveys the use of empirical methods for a particular natural language-understanding task that is inherently domain specific. The task is information extraction. Generally, an information-extraction system takes as input an unrestricted text and "summarizes" the text with respect to a prespecified topic or domain of interest: It finds useful information about the domain and encodes the information in a structured form, suitable for populating databases. In contrast to in-depth natural language-understanding tasks, information-extraction systems effectively skim a text to find relevant sections and then focus only on these sections in subsequent processing. The information-extraction system in figure 1, for example, summarizes stories about natural disasters, extracting for each such event the type of disaster, the date and time that it occurred, and data on any property damage or human injury caused by the event.

[Figure 1 ILLUSTRATION OMITTED]

Information extraction has figured prominently in the field of empirical NLP: The first large-scale, head-to-head evaluations of NLP systems on the same text-understanding tasks were the Defense Advanced Research Projects Agency-sponsored Message-Understanding Conference (MUC) performance evaluations of information-extraction systems (Chinchor, Hirschman, and Lewis 1993; Lehnert and Sundheim 1991). Prior to each evaluation, all participating sites receive a corpus of texts from a predefined domain as well as the corresponding answer keys to use for system development. The answer keys are manually encoded templates--much like that of figure 1--that capture all information from the corresponding source text that is relevant to the domain, as specified in a set of written guidelines. After a short development phase,(1) the NLP systems are evaluated by comparing the summaries each produces with the summaries generated by human experts for the same test set of previously unseen texts. The comparison is performed using an automated scoring program that rates each system according to measures of recall and precision. Recall measures the amount of the relevant information that the NLP system correctly extracts from the test collection; precision measures the reliability of the information extracted:

recall = (# correct slot fillers in

output templates) / (# slot

fillers in answer keys)

precision = (# correct slot fillers in

output templates) / (# slot

fillers in output templates)

As a result of MUC and other information-extraction efforts, information extraction has become an increasingly viable technology for real-world text-processing applications. For example, there are currently information extraction systems that (1) support underwriters in analyzing life insurance applications (Glasgow et al. 1997); (2) summarize medical patient records by extracting diagnoses, symptoms, physical findings, test results, and therapeutic treatments to assist health-care providers or support insurance processing (Soderland, Aronow, et al. 1995); (3) analyze news wires and transcripts of radio and television broadcasts to find and summarize descriptions of terrorist activities (MUC-4 1992; MUC-3 1991); (4) monitor technical articles describing microelectronic chip fabrication to capture information on chip sales, manufacturing advances, and the development or use of chip-processing technologies (MUC-5 1994); (5) analyze newspaper articles with the goal of finding and summarizing business joint ventures (MUC-5 1994); and (6) support the automatic classification of legal documents (Holowczak and Adam 1997). …

The rest of this article is only available to active members of Questia

Already a member? Log in now.

Notes for this article

Add a new note
If you are trying to select text to create highlights or citations, remember that you must now click or tap on the first word, and then click or tap on the last word.
One moment ...
Default project is now your active project.
Project items

Items saved from this article

This article has been saved
Highlights (0)
Some of your highlights are legacy items.

Highlights saved before July 30, 2012 will not be displayed on their respective source pages.

You can easily re-create the highlights by opening the book page or article, selecting the text, and clicking “Highlight.”

Citations (0)
Some of your citations are legacy items.

Any citation created before July 30, 2012 will labeled as a “Cited page.” New citations will be saved as cited passages, pages or articles.

We also added the ability to view new citations from your projects or the book or article where you created them.

Notes (0)
Bookmarks (0)

You have no saved items from this article

Project items include:
  • Saved book/article
  • Highlights
  • Quotes/citations
  • Notes
  • Bookmarks
Notes
Cite this article

Cited article

Style
Citations are available only to our active members.
Buy instant access to cite pages or passages in MLA, APA and Chicago citation styles.

(Einhorn, 1992, p. 25)

(Einhorn 25)

1. Lois J. Einhorn, Abraham Lincoln, the Orator: Penetrating the Lincoln Legend (Westport, CT: Greenwood Press, 1992), 25, http://www.questia.com/read/27419298.

Cited article

Empirical Methods in Information Extraction
Settings

Settings

Typeface
Text size Smaller Larger Reset View mode
Search within

Search within this article

Look up

Look up a word

  • Dictionary
  • Thesaurus
Please submit a word or phrase above.
Print this page

Print this page

Why can't I print more than one page at a time?

Help
Full screen

matching results for page

    Questia reader help

    How to highlight and cite specific passages

    1. Click or tap the first word you want to select.
    2. Click or tap the last word you want to select, and you’ll see everything in between get selected.
    3. You’ll then get a menu of options like creating a highlight or a citation from that passage of text.

    OK, got it!

    Cited passage

    Style
    Citations are available only to our active members.
    Buy instant access to cite pages or passages in MLA, APA and Chicago citation styles.

    "Portraying himself as an honest, ordinary person helped Lincoln identify with his audiences." (Einhorn, 1992, p. 25).

    "Portraying himself as an honest, ordinary person helped Lincoln identify with his audiences." (Einhorn 25)

    "Portraying himself as an honest, ordinary person helped Lincoln identify with his audiences."1

    1. Lois J. Einhorn, Abraham Lincoln, the Orator: Penetrating the Lincoln Legend (Westport, CT: Greenwood Press, 1992), 25, http://www.questia.com/read/27419298.

    Cited passage

    Thanks for trying Questia!

    Please continue trying out our research tools, but please note, full functionality is available only to our active members.

    Your work will be lost once you leave this Web page.

    Buy instant access to save your work.

    Already a member? Log in now.

    Oops!

    An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.