Academic journal article K@ta

Construction of the Chinese Learners' Parallel Corpus of Japanese and Its Preliminary Analysis

Academic journal article K@ta

Construction of the Chinese Learners' Parallel Corpus of Japanese and Its Preliminary Analysis

Article excerpt

Abstract: This study aims to introduce the project to construct the Chinese learners' corpus (LC) of Japanese at Dalian University of Technology (DUT), and detail the LC construction, development of DUT Corpus Linguistics Tools, and contribution to the education of Japanese as a second language. The outstanding characteristic of the LC is its parallel form with learners' Japanese texts and their Chinese translation, which enables us to make comprehensive analysis of the influence of Chinese (L1) to Japanese (L2). We have made a preliminary analysis of the errors contained.

Key words: Chinese learners' corpus of japanese; parallel corpus; tagging tool, Sino-Japanese

Recognizing the significance of specifying the learners' first languages as well as the target languages in constructing LC, several L1 and L2-specified LC have been constructed. Especially in China, several projects to construct the Chinese learners corpora of English (Gui & Yang, 2003), including the Chinese portion of International Corpus of Learner English, HKUST Corpus, etcetera are under way (Yang, 2002). Meanwhile, the construction of learners' corpora of Japanese has been carried out mainly in Japan (Ooso & Takizawa, 2003), but their learners' first languages are fairly diversified. Given these backgrounds, this study focuses on the learners' corpus of Japanese written by the Chinese students to clarify the aspects of L1 to L2 interference.


Japanese compositions written by 412 Chinese university students of 3rd or 4th year were collected, which contain about 13,800 sentences. The compositions were of two styles: one narrative and the other expository. Selecting one of the topics: (1) Japanese for Me, (2) My Hometown in Mind, (3) Computers and Language Learning, and (4) Economic Development and Environmental Problems, the students wrote Japanese sentences first, and then translated them into Chinese, so that the parallel learners' corpus could be identified. From a pedagogical point of view, the order of writing Japanese sentences first and then translating them into Chinese is important for avoiding the unexpected transfer from L1. Translation of the compositions is also meaningful for avoiding the misunderstanding of the learners' intended meaning of each sentence, and only this point was explained to the students as the reason for their translation.

All the compositions were handwritten so that we could acquire the data of character errors, especially the use of Simplified Chinese Characters (jiantizi) or the different forms of characters which are not used in Japanese. Digitalizing of this data was undertaken using the following procedure: 1. to separate the whole body of compositions into sentences, 2. to number each sentence with initial 'J' for Japanese and 'C' for Chinese, giving the same number for equivalent sentences, and 3. to save each composition as text files (.txt).

Development of DUT Corpus Linguistics Tools

The next procedure in constructing LC is tagging to attribute the background information of the learners and the error information to each sentence. An original set of tools (DUT Corpus Linguistics Tools) was developed to tag the error information and conduct a preliminary statistical analysis. In the tagging window (Fig.1), we can input the information about learners' basic background and the data concerning each error after selecting the composition files and store them in the database file (.mdb). After completing these processes the XML files (.xml) describing all the information are automatically generated for each composition.

Based on the information stored in the database file, a preliminary statistical analysis (Fig.2) and simple query of errors can be carried out with these tools. We can acquire a general tendency of errors through this analysis and easily retrieve samples of each error type.


To carry out the tagging process, we have constructed a preliminary error tagset for Chinese Learners Japanese compositions (Error Tagset for DUT CJLC, ver. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed


An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.