Developing the ICE Corpus Utility Program
NICK PORTER and AKIVA QUINN
Soon after the Survey of English Usage initiated the project, it was realized that there was a need for a general corpus processing and analysis tool for the International Corpus of English. In its broadest conception, this tool was to cover the central requirements for corpus preparation and study, including corpus annotation, markup conversion and filtering, searching, concordancing, statistical analyses, subcorpus information, and subcorpus selection. The system would primarily target ICE corpora, yet would provide a range of general corpus utilities equally applicable to other corpora.
The requirements for ICECUP were determined by the design and content of ICE and the corpus utilities that were to be provided. The International Corpus exists primarily to allow comparison between national and regional varieties of English, but further, each component corpus is structured according to medium (spoken, manuscript, and printed) and genre (news, business, natural sciences, novels, and so on). Describing and quantifying linguistic features within national, medium, and genre categories is a key requirement.
ICE corpora use Standard Generalized Markup Language (SGML) to encode typographic and content features of a text, as well as word-classes, syntactic structures, and functions. Searches and concordances should be able to include any combination of these markup symbols in their search arguments, together with lexical items, punctuation, and wildcards. The citations or concordance lines shown should be able to focus attention on relevant annotations by selectively filtering out markup symbols. Making functions easy to find, and providing help to explain the options at any point in the program, are also key parts of the specification. Besides providing facilities to support linguistic research, ICECUP has to be easy to use and accessible to potential users.
A modular programming language that is widely used, and available on a popular platform, would enhance the program's flexibility and reusability. Languages such