An Overview of Empirical Natural Language Processing
Brill, Eric, Mooney, Raymond J., AI Magazine
One of the biggest challenges in natural language processing is how to provide a computer with the linguistic sophistication necessary for it to successfully perform language-based tasks. This special issue presents a machine-learning solution to the linguistic knowledge-acquisition problem: Rather than have a person explicitly provide the computer with information about a language, the computer teaches itself from online text resources.
A Brief History of Natural Language Research
Since its inception, one of the primary goals of Al has been the development of computational methods for natural language understanding. Early research in machine translation illustrated the difficulties of this task with sample problems such as translating the word pen appropriately in "The box is in the pen" versus "The pen is in the box" (Bar-Hillel 1964). It was quickly discovered that understanding language required not only lexical and grammatical information but semantic, pragmatic, and general world knowledge. Nevertheless, during the 1970s, AI systems were developed that demonstrated interesting aspects of language understanding in restricted domains such as the blocks world (Winograd 1972) or answers to questions about a database of information on moon rocks (Woods 1977) or airplane maintenance (Waltz 1978). During the 1980s, there was continuing progress on developing natural language systems using hand-coded symbolic grammars and knowledge bases (Allen 1987). However, developing these systems remained difficult, requiring a great deal of domain-specific knowledge engineering. In addition, the systems were brittle and could not function adequately outside the restricted tasks for which they were designed. Partially in reaction to these problems, in recent years, there has been a paradigm shift in natural language research. The focus has shifted from rationalist methods based on hand-coded rules derived to a large extent through introspection to empirical, or corpus-based methods in which development is much more data driven and is at least partially automated by using statistical or machine-learning methods to train systems on large amounts of real language data. These two approaches are characterized in figures 1 and 2.
[Figures 1. and 2. ILLUSTRATION OMITTED]
Empirical and statistical analyses of natural language were previously popular in the 1950s when behaviorism was thriving in psychology (Skinner 1957), and information theory was newly introduced in electrical engineering (Shannon 1951). Within linguistics, researchers studied methods for automatically learning lexical and syntactic information from corpora, the goal being to derive an algorithmic and unbiased methodology for deducing the structure of a language. The main insight was to use distributional information, such as the environment a word can appear in, as the tool for language study. By clustering words and phrases based on the similarity of their distributional behavior, a great deal could be learned about a language (for example, Kiss , Stolz , Harris , Chatman , Harris , and Wells ). Although the goal of this research was primarily to gain insight into the structure of different languages, this framework parallels that of modern empirical natural language processing: Given a collection of naturally occurring sentences as input, algorithmically acquire useful linguistic information about the language.
Distributional linguistics research began to wane after Chomsky's (1959, 1957) influential work dramatically redefined the goals of linguistics. First, Chomsky made the point that a linguist should not merely be descriptive, discovering morphological, lexical, and syntactic rules for a language, but should turn instead to what he saw as more interesting problems, such as how language is learned by children and what features all languages share in common. These phenomena are far from surface apparent and, therefore, not amenable to a shallow corpus-based study. …