Academic journal article International Forum of Teaching and Studies

Corpus-Based Machine Translation: Its Current Development and Perspectives

Academic journal article International Forum of Teaching and Studies

Corpus-Based Machine Translation: Its Current Development and Perspectives

Article excerpt

A Review of Corpus-based Machine Translation

Corpus and Machine Translation

Corpus is a large-scale database with tremendous collective linguistic information in real use, which is provided for retrieval by computers for research. The first corpus was established in Brown University, U.S.A., in the late 1960s (Zhang & Zhang, 2010, p. 55). Much progress has been made in corpus research and application in the past decades. Current studies of parallel or multi-language corpus can be categorized into three aspects: the first is the alignment technology of parallel language material, with various strategies and approaches provided by scholars and with numbers of programs and tools of alignment parallel or multi-language material; the second is the application of parallel language materials, such as statistic-based machine translation, example-based machine translation, and parallel language dictionary compiling; the third refers to issues of parallel language corpus design and management, and its material collection and coding (Chang, et al., 2003, p. 28).

Corpus used in translation is one of the focuses of corpus application research. Machine translation (MT) is a technology to translate from one natural language in character or speech form into another by means of computer programs (Zhao & Liu, 2010, p. 36). MT was initiated in the 1950s and entered into a prosperous period in the end of 1980s, which characterizes practicality of many translation systems in various fields. An English-French translation system TAUMMETEO developed by the University of Montreal, Canada, in 1976, is a typical example, which can provide high-quality translation of weather forecast (Shao, 2010, p. 28). A typical MT system adopts a transfer-based translation strategy, which consists of 3 procedures: 1) analyzing a source language and form representation of the language; 2) transforming representation of the source language into that of the target language; 3) generating the source language translation version from the target language representation (He, 2007, p. 191). Traditional MT has two defects: one is that traditional MT regards words as its basic translation unit; that is, the machine first segments sentences of a source language into words, which are transformed into those of target language, and then those words are connected according to grammatical structure rules of the target language; the other is that traditional MT does not pay much attention to contexts. Peer-to-peer studies of corpus-based MT do attempt to get over the defects of the traditional MT systems and improve efficiencies and accuracy of the MT systems.

With the development over more than 50 years, MT systems have performed certain functions in some fields. However, the current systems have not reached the effect of translation as expected. In its earlier period, MT research was conducted from the viewpoint of natural linguistics, thus creating MT systems based on such linguistic rules as lexical rule, syntactic analysis rule, transformation rules, and target language generation rules. As these rules were summarized and developed from experiences of linguists, there exist some deficiencies in the analytic rules. For instance, manual writing of those rules demands a large quantity of workload, and the rules are too subjective to keep consistency (Wang, 2003, p. 33). Since 1989, MT has entered a new stage in which corpus methods are introduced to the rule-based technology, including statistic-based and example-based methods, and the method of turning corpus into linguistic knowledge bank through language material processing, etc. (Feng, 2010, p. 28). The past years have seen the prominent achievement in MT systems.

Corpus Used in Machine Translation

There are three kinds of corpus concerning MT: parallel corpus, multi-language corpus, and comparable corpus. Parallel corpus collects original text of a language and its translating text of another language. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed


An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.