Getting computer systems to understand natural language input is a tremendously difficult problem and remains a largely unsolved goal of Al. In recent years, there has been a flurry of research into empirical, corpus-based learning approaches to natural language processing (NLP). Whereas traditional NLP has focused on developing hand-coded rules and algorithms to process natural language input, corpus-based approaches use automated learning techniques over corpora of natural language examples in an attempt to automatically induce suitable language-processing models. Traditional work in natural language systems breaks the process of understanding into broad areas of syntactic processing, semantic interpretation, and discourse pragmatics. Most empirical NLP work to date has focused on using statistical or other learning techniques to automate relatively low-level language processing such as part-of-speech tagging, segmenting text, and syntactic parsing. The success of these approaches, following on the heels of the success of similar techniques in speech-recognition research, has stimulated research in using empirical learning techniques in other facets of NLP, including semantic analysis--uncovering the meaning of an utterance.
In the area of semantic interpretation, there have been a number of interesting uses of corpus-based techniques. Some researchers have used empirical techniques to address a difficult subtask of semantic interpretation, that of developing accurate rules to select the proper meaning, or sense, of a semantically ambiguous word. These rules can then be incorporated as part of a larger system performing semantic analysis. Other research has considered whether, at least for limited domains, virtually the entire process of semantic interpretation might yield to an empirical approach, producing a sort of semantic parser that generates appropriate machine-oriented meaning representations from natural language input. This article is an introduction to some of the emerging research in the application of corpus-based, learning techniques to problems in semantic interpretation.
The task of word-sense disambiguation (WSD) is to identify the correct meaning, or sense, of a word in context. The input to a WSD program consists of real-world natural language sentences. Typically, a separate phase prior to WSD to identify the correct part of speech of the words in the sentence is assumed (that is, whether a word is a noun, verb, and so on). In the output, each word occurrence w is tagged with its correct sense, in the form of a sense number i, where i corresponds to the i-th sense definition of w in its assigned part of speech. The sense definitions are those specified in some dictionary. For example, consider the following sentence: In the interest of stimulating the economy, the government lowered the interest rate.
Suppose a separate part-of-speech tagger has determined that the two occurrences of interest in the sentence are nouns. The various sense definitions of the noun interest, as given in the Longman Dictionary of Contemporary English (LDOCE) (Bruce and Wiebe 1994; Procter 1978), are listed in table 1. In this sentence, the first occurrence of the noun interest is in sense 4, but the second occurrence is in sense 6. Another wide-coverage dictionary commonly used in WSD research is WORDNET (Miller 1990), which is a public-domain dictionary containing about 95,000 English word forms, with a rather refined sense distinction for words.
WSD is a long-standing problem in NLP. To achieve any semblance of understanding natural language, it is crucial to figure out what each individual word in a sentence means. Words in natural language are known to be highly ambiguous, which is especially true for the frequently occurring words of a language. For example, in the WORDNET dictionary, the average number of senses for each noun for the most frequent 121 nouns in English is 7. …