ICE Annotation Tools
AKIVA QUINN and NICK PORTER
A key aspect of the International Corpus of English (ICE) is the detailed linguistic annotation it contains. Providing annotations for textual features, word classes, syntactic categories, and functions allows comparisons to be conducted along many axes. Transforming a raw input file into a properly annotated text can involve much work, so software tools have been developed to keep this effort to a minimum. This paper describes three programs produced by the Survey of English Usage for machine-assisted annotation. The Markup Assistant automates the insertion of textual markup, generating ICE markup symbols at a single key press and ensuring that markup symbols are closed. The ICE Tag Selection System automates the selection from the alternative word-class tags generated by an automatic word-class tagger. The ICE Syntactic Marking System automates the addition of syntactic markers to texts prior to parsing by an automatic parser. The ICE Syntactic Tree annotator complements automatic parsing by providing a graphical environment for the manual editing of syntactic analyses.
The ICE Markup Assistant automates and simplifies key presses for the insertion of the standard set of ICE markup symbols used throughout the project. ICE uses Standard Generalised Markup Language (SGML) to encode, in a machine- independent manner, a range of typographic and content features, and the structure of a text. Implemented as a set of macros under WordPerfect, the text unit markup is inserted automatically at probable sentence boundaries--after each full stop, question mark, and exclamation mark in the text that is followed by a space or an end of line. Having the majority of text units correctly inserted saves time, and additional text units inserted after abbreviations can easily be deleted. Reduced key presses are provided for all the standard ICE markup symbols. Most markup types require an open and close symbol for each sequence that forms a paragraph, appears in boldface, and so on. For instance the following sequence represents two words in boldface: