Behind the Word Clouds: Electronic Text, Machine Reading and Corpus Linguistics: Tim Shortis Argues That Corpus Linguistics Is Changing Knowledge about Language, and Explains the Theory Behind It and Its Potential for the Classroom

Article excerpt

A revolution in knowledge about language

Hyperbole comes easily in the excited discourse around the impact of ICTs and their ongoing penultimate promises. So it is with caution that I am suggesting that there is a quiet revolution going on in what counts as knowledge about language and meaningful reading and that little of this has permeated what is done in school English lessons so far. This may be about to change as we come face to face with ever-larger collections of electronically mediated text and exemplification of new methods for reading them.

The new and ever larger collections of resources are apparent, the means of reading them less so--although recent higher education research may point a way. UK Public Library memberships now offer free home access to the full Oxford English Dictionary ( along with digital archives of contemporary and historical newspapers. Agencies such as the National Archive, the Old Bailey, JISC and The British Library have all put significant collections of searchable text online. Some of these are plain text, some facsimiles, and some both. For example, Oxford University/JISC collaboration's magnificent World War 1 collection (http:/ / offers 5,000 textual artefacts mainly in facsimile form but with searchable words-only transcripts. All these collections involve engaging with a different order of textual scale and will require different kinds of literacy to being curled up in your chair reading a book under the light of an Anglepoise, although that will of course, remain important. The question then is how are we, as English teachers, as a professional community of practice specializing in the learning of literacy, to respond to these changes in the representational resources of the written word? What is our responsibility as such archives become available to students and future citizens, including our role in equipping these people in our care to understand and resist the abuse of the associated technologies of machine--reading in its aggressive forms: data-mining for commercial and political exploitation and its infringements of privacy, for example?

The data-driven study of very large collections of electronic text, assisted by the machine reading capacities of computers, or corpus linguistics as it has become known, has transformed understanding of core domains of language study, and even of the concepts thought to be required to study it. Linguists have re-considered the relationship of speech to writing (Biber, 1991), the actual nature of informal spoken interaction, including its 'grammar' and routine creativity (see Carter and McCarthy 1997, Carter 2004), gendered patterns in computer mediated communication (Herring 1996, or her homepage), the relationship of text messaging to spoken language (Caroline Tagg, forthcoming), the histories of languages, including the actual levels of standardisation over time, and the forensic identification of the Unabomber (see Coulthard and Johnson, 2007). Corpus linguistics has tested core concepts about language to the point of their destruction, sometimes having to reach for new terms and concepts to make sense of what is being found.


While the juggernaut of the National Strategy has been 'rolling out' the grammatical terminology of the 1960s in schools in England (Alexander 2007), linguists have been asking first questions about the role of grammar and its structuring. This has even led to 'a completely new theory of language based on how words are used in the real world': lexical priming (Hoey 2005). This argues, counter to the received wisdom of countless linguistic authorities, that 'vocabulary is complexly and systematically structured and that grammar is an outcome of this lexical structure'. Machine-reading has enabled linguists to perceive such deeper patterns in language in 'collocation': the property of language whereby two or more words seem to appear frequently in each other's company (e. …