An Overview of Empirical Natural Language Processing

Article excerpt

In recent years, there has been a resurgence in research on empirical methods in natural language processing. These methods employ learning techniques to automatically extract linguistic knowledge from natural language corpora rather than require the system developer to manually encode the requisite knowledge. The current special issue reviews recent research in empirical methods in speech recognition, syntactic parsing, semantic processing, information extraction, and machine translation. This article presents an introduction to the series of specialized articles on these topics and attempts to describe and explain the growing interest in using learning methods to aid the development of natural language processing systems.

One of the biggest challenges in natural language processing is how to provide a computer with the linguistic sophistication necessary for it to successfully perform language-based tasks. This special issue presents a machine-learning solution to the linguistic knowledge-acquisition problem: Rather than have a person explicitly provide the computer with information about a language, the computer teaches itself from online text resources.

A Brief History of Natural Language Research

Since its inception, one of the primary goals of AI has been the development of computational methods for natural language understanding. Early research in machine translation illustrated the difficulties of this task with sample problems such as translating the word pen appropriately in "The box is in the pen" versus "The pen is in the box" (Bar-Hillel 1964). It was quickly discovered that understanding language required not only lexical and grammatical information but semantic, pragmatic, and general world knowledge. Nevertheless, during the 1970s, AI systems were developed that demonstrated interesting aspects of language understanding in restricted domains such as the blocks world (Winograd 1972) or answers to questions about a database of information on moon rocks (Woods 1977) or airplane maintenance (Waltz 1978). During the 1980s, there was continuing progress on developing natural language systems using hand-coded symbolic grammars and knowledge bases (Allen 1987). However, developing these systems remained difficult, requiring a great deal of domain-specific knowledge engineering. In addition, the systems were brittle and could not function adequately outside the restricted tasks for which they were designed. Partially in reaction to these problems, in recent years, there has been a paradigm shift in natural language research. The focus has shifted from rationalist methods based on hand-coded rules derived to a large extent through introspection to empirical, or corpus-based, methods in which development is much more data driven and is at least partially automated by using statistical or machine-learning methods to train systems on large amounts of real language data. These two approaches are characterized in figures 1 and 2.

Empirical and statistical analyses of natural language were previously popular in the 1950s when behaviorism was thriving in psychology (Skinner 1957), and information theory was newly introduced in electrical engineering (Shannon 1951). Within linguistics, researchers studied methods for automatically learning lexical and syntactic information from corpora, the goal being to derive an algorithmic and unbiased methodology for deducing the structure of a language. The main insight was to use distributional information, such as the environment a word can appear in, as the tool for language study. By clustering words and phrases based on the similarity of their distributional behavior, a great deal could be learned about a language (for example, Kiss [1973], Stolz [1965], Harris [1962], Chatman [1955], Harris [1951], and Wells [1947]). Although the goal of this research was primarily to gain insight into the structure of different languages, this framework parallels that of modern empirical natural language processing: Given a collection of naturally occurring sentences as input, algorithmically acquire useful linguistic information about the language. …