The Design of the Corpus
The International Corpus of English is at present envisaged to comprise eighteen national or regional corpora compiled in countries where English is spoken as a first language or as a second official language ( Greenbaum, 1992). While each component corpus can exist independently as a valuable resource for investigations into individual national or regional varieties, the value of the corpora is enhanced by their compatibility with each other. They have been compiled according to a common design, using the same criteria for text selection and the same time-frame. This level of standardization ensures that the corpora can be used for direct comparative studies of varieties of English throughout the world. The design of the corpus emerged after extensive discussions among participants in the project, which centred on the range of text categories to be included and on the textual and social variables to be taken into account in the selection and documentation of samples ( Leitner, 1992; Peters, 1991; Schmied, 1990).
Each national or regional corpus consists of 500 texts of approximately 2,000 words each, giving a total of approximately one million words. Many of the texts are composite, since they consist of two or more different samples of the same type which are combined to make up 2,000 words. Since in some categories, such as business letters and press news reports, almost all the 2,000-word texts are composite, all the corpora contain more than 500 samples. As far as possible, the texts are self-contained. Extracts from printed works were taken from the beginning of a chapter, and spoken extracts begin at the start of a topic and at a new speaker turn. The end of a text coincides with a paragraph ending in printed texts, and with the end of an exchange or topic in spoken ones. We have omitted texts which contain large numbers of quotations, foreign words, or mathematical formulae. We have avoided certain types of creative writing in which the use of English is intentionally idiosyncratic. For the same reason, poetry has been excluded from the corpus. Before looking at the categories of texts in the corpus, I will discuss the general criteria for text selection.