Markup is the first level of annotation applied to the component corpora in ICE. It may be divided into two distinct types: textual markup, which is added to the texts themselves, and bibliographical and biographical markup, which is stored externally in the form of a file header for each text. The system for textual markup is based on a proposal by Rosta ( 1990) and is fully described in two manuals, one each for spoken and written texts ( Nelson, 1991a, 1991b). The system for encoding bibliographical and biographical information is described in Nelson ( 1991c). In this paper I will discuss both markup types in turn, giving examples from the British ICE corpus (ICE-GB). Finally, I will discuss some of the ways in which markup is used in text retrieval.
Textual markup encodes features of the original text that are lost when it is converted into a computerized text file. The texts are stored as plain ASCII files, so in written texts, for example, typographic features such as boldface, italics, and underlining are lost during computerization. In spoken texts, the transcription must be marked up to indicate such features as pauses, speaker turns, and overlapping segments. These textual features are encoded by adding markup symbols to the text. All markup symbols are enclosed within angled brackets. In most cases they appear in pairs, with an opening symbol 〈symbol〉 and a closing symbol 〈/symbol〉. For example, if the word 'every' appears in boldface in the original printed text, then it will appear as 〈bold〉every〈/bold〉 in the corpus. Similarly headings are enclosed within 〈h〉 and 〈lh〉, while paragraphs are enclosed within 〈p〉 and 〈/p〉. The markup symbols are inserted manually, but the process is partially automated by the Markup Assistant program. This is a set of WordPerfect macros which assigns whole markup symbols to single keys. When the markup has been applied, the CHECKMUP program in ICECUBE checks that all the symbols are valid and that every opening symbol has a corresponding closing one. A complete list of the ICE markup