Academic journal article English Language Teaching

A Computational Investigation of Cohesion and Lexical Network Density in L2 Writing

Academic journal article English Language Teaching

A Computational Investigation of Cohesion and Lexical Network Density in L2 Writing

Article excerpt


This study used a new computational linguistics tool, the Coh-Metrix, to investigate and measure the differences in cohesion and lexical network density between native speaker and non-native speaker writing, as well as to investigate L2 proficiency level differences in cohesion and lexical network density. This study analyzed data from three corpora with the Coh-Metrix: the International Corpus of Learner English (ICLE) as an L2 higher proficiency group, the Louvain Corpus of Native English Essays (LOCNESS) as a native speaker baseline, and a collected EFL corpus from Indonesia for the L2 lower proficiency data.

Statistical investigation of the Coh-Metrix results revealed that five out of six Coh-Metrix variables used in this study did not detect proficiency level differences in L2 but the tool was consistently able to distinguish between L2 and native speaker writing. Differences included that L2 writing contains more argument overlap, more semantic overlap, more frequent content words, fewer abstract verb hyponyms and less causal content than native speaker writing.

Keywords: cohesion, NLP, second language writing, corpus linguistics, computational linguistics

1. Introduction

Learning to write extended discourse in a second language is a difficult skill for second language learners. It is also a fundamentally important skill for many non-native speakers who need to develop a command of written English for academic and professional success (Silva, 1993). Language mechanics such as orthography, punctuation and lexical selection have long been established areas of difficulty for L2 writers, and salient features which mark L2 writing as non-native like (White, 1987). Regarding advanced academic prose, Cumming (2001), in a review of twenty years of empirical studies on second language writing, identified that the most difficult developmental areas include the complex syntax, rhetorical strategies and specificity of vocabulary needed for the academic register.

For teachers and assessors, L1-L2 differences in language mechanics are relatively easy to objectively discern and measure (Bardovi-Harlig & Bofman, 1989). It is more difficult, however, to investigate and quantitatively measure L1-L2 differences in textual cohesion and lexical network density, and it unclear how these develop in non-native speakers as proficiency increases. These areas need research attention. Cohesion is a crucial skill that L2 writers need for academic success (Mirzapour & Ahmadi, 2011), as without the ability to create cohesion through the appropriate use of language, texts are rendered difficult to follow (Halliday & Hasan, 1976). Being able to define how native and non-native speakers differ with regard to cohesion and lexical network density, as well as knowing how these features differ across L2 proficiency levels, would be beneficial for understanding L2 writing development, for designing instruction, and for validating writing tests.

1.1 Computational Tools

Computational tools such as ETS's eRater (Attali & Burstein, 2006) and other corpus linguistics' software are able to investigate L2 lexical differences and mechanics, however until recently no computational system has been able specifically analyse cohesion and lexical network density. Advances in computational linguistics and natural language processing (NLP) have made available a comprehensive new software tool, the Coh-Metrix, which has the potential for a deep level quantitative investigation of textual cohesion and lexical network density in second language writing. The system draws together research from a variety of disciplines including discourse analysis, psycholinguistics, corpus linguistics and natural language processing, making use of previous computational systems by incorporating WordNet (Miller at al, 1995), the CELEX database (Coltheart, 1981), the MRC psycholinguistics database (Baayen, Piepenbrock, & Gulikers, 1995), Latent Semantic Analysis, as well as a range of other part of speech taggers, lexicons, and semantic interpreters. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed


An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.