The Corpus as a Research Domain
The design and development of machine-readable corpora and tools for their analysis has been a major preoccupation of corpus linguistics for more than three decades. During this period there have been massive changes in the capacity and speed of computers, an increasing use of microcomputers with CD-ROM as the basis for storage, the development of new and faster means of text capture through optical scanning, and the development of more sophisticated software packages for the analysis of corpora. At the same time there have been continuing issues in the design and use of corpora. How big should a corpus be to provide a valid and reliable picture of how a language is structured and used? What aspects of a language can be validly and reliably described using a corpus of a particular size? Can a corpus be designed to be a representative sample of a language 'as a whole'? Should particular genres be represented in a corpus and, if so, which genres? What are the respective roles of automatic and manual analysis in corpus-based research?
As Quirk ( 1992) has noted, machine-readable corpora have grown in size from the one-million-word standard of the Survey of English Usage (SEU) Corpus and the Brown Corpus of the early 1960s to the 100-million-word British National Corpus (BNC) of the 1990s. The most recent developments, including vast monitor corpora of potentially unlimited size and the International Corpus of English (ICE), promise to open up new directions in the use of corpora.
Although there have been numerous corpus-based studies of English completed since 1960, the changes in technology mentioned above and issues in the design and development of corpora have in a sense been necessary prerequisites for systematic corpus-based research. It is just such systematic research, involving comprehensive lexical and grammatical description and comparisons across genres, registers, and major regional varieties which the ICE project will encourage and facilitate. The purpose of this paper is to outline some matters which might be considered part of a research agenda for the ICE corpus.
To a considerable extent, the size, nature, and structure of machine-readable corpora and the associated software, determine the kind of linguistic research which can be undertaken. Obviously, for example, however big a corpus may be, if it includes only written texts, it cannot reasonably be used as a basis for the description of the 'language as a whole' nor for lexical, grammatical, or discourse characteristics of