What is corpus linguistics?
Corpus analysis is now well established in higher education as a way of working with language. Working with language is at the heart of our business as English teachers, whether we prefer to find the stuff in literary or everyday contexts, so it follows that we might want to know something about it. Add the fact that this is an approach which makes powerful and meaningful use of ICT for learning (not just typing), and is an approach developed specifically within and for our subject, and we must surely be onto something good. But what is it? Corpus is a Latin loan word with its meaning rooted in 'body', and connected with the word 'corpse'. Though 'corpus analysis' might consequently sound like a Year 9 dramatic adaptation of Robert Louis Stevenson's Body Snatchers, in HE linguistics it is applied in a much less literal way, to mean a body of data. Too much blood in the Year 9 discourse, too little sense of human life, love and literature in the linguistics discourse: what does this mean?
A body of data is a collection of 'text' of any sort: conventional written texts of any kind or transcripts of any variety of spoken language. It could be a corpus of transcripts of domestic conversations; or a corpus consisting of the complete works of Shakespeare; a corpus of 19th century letters; a corpus of newspaper articles from the year 2008. Major 'corpora' such as the British National Corpus (BNC) or Cambridge and Nottingham Corpus of Discourse in English (CANCODE) consist of millions of words. CANCODE weighs in at five million words, the BNC at 100 million. However, size is not a defining feature: corpus studies have also been carried out with much smaller bodies of data, for example a corpus of twenty Christmas charity fundraising appeals (Blake and Shortis, 2007). Not unsurprisingly, it depends on the purpose of the investigation.
Next, analysis: corpus analysis is a set of methods for interrogating the chosen body of data. Because the body of data can be millions of words, this method uses ICT to do kinds of very rapid processing that would otherwise be too time-consuming and laborious to be worthwhile. So, the corpus consists of machine-readable electronic versions of the texts, and a software programme is then used to do various kinds of search and retrieval. These searches can be for specific words, for phrases, or for grammatical 'chunks'.
As an approach to language study, the idea of corpus analysis dates back to the 1930s and the work of J.R. Firth (see Tim Shortis's article). However, it did not--could not--take off until the widespread availability of personal computers and faster processing speeds made it technically possible in the 1980s and increasingly practicable in the 1990s. To give an idea of how widespread this approach now is in Higher Education contexts, 20% of the individual papers given at the 2007 conference of the British Association for Applied Linguistics referred to corpus methods. The study of corpus linguistics is available as a discrete module in some language/linguistics degree programmes, for example at the universities of Lancaster and Liverpool. Given that some language/linguistics graduates go on to become English teachers, there will be teachers who are familiar with all this, though they may be keeping it under their hats, uncertain how it might relate to school English.
Corpus in the classroom
The relationship between corpus analysis as an HE research method and as a classroom pedagogical practice is not yet a developed one. It has been mobilised more effectively in the context of Teaching English as a Foreign Language, one in which commercial imperatives and the engagement of applied linguists provide more resources for research and development than has been the case in mainstream secondary education. There, corpus thinking has led a quiet revolution in the making of dictionaries and grammars, with these now drawing routinely on large corpora of everyday spoken and written English. …