AUTASYS: Grammatical Tagging and Cross-Tagset Mapping
ALEX CHENGYU FANG
Ever since the advent of the first computer linguistic corpus in the 1960s, linguists and computer programmers have been working on the annotation of material thus stored. Word-class tagging, the assignment of an unambiguous indication of the grammatical word class to each word in a text, has been in great demand, not only in lexicographical and grammatical studies, but also in natural language processing (NLP), an area where the corpus-based, or more specifically, probabilistic approach is becoming increasingly popular. Taggers have flourished and the past twenty years or so have witnessed TAGGIT ( Greene and Rubin, 1971), CLAWS ( Marshall, 1983; Garsideet al., 1987), FALSUNGA ( DeRose, 1988), AGTS ( Huang, 1991), and TOSCA ( Oostdijk, 1991), to name just a few. Tagsets different in various aspects have also come into being, with Brown ( Francis, 1980), LOB ( Johansson et al., 1986), and Lund ( Svartvik, 1987) as the best known. Most recently, a tagset has been designed at the Survey of English Usage (SEU), University College London ( Greenbaum and Ni, 1994; Greenbaum, 1995), which has been used to annotate the one- million-word British component of the International Corpus of English (ICE-GB, cf. Greenbaum, 1992).
This has created an intriguing situation in corpus annotation. On the one hand, compilers of corpora vary in what they intend as the primary uses of their corpora. Grammarians, lexicographers, language teachers, and NLP researchers naturally want different information from corpus annotation: grammatical, morphological, discoursal, statistical, semantic, pragmatic, or prosodic. On the other hand, unfortunately, we have not seen any single annotation scheme that meets all these requirements. Corpora thus differently annotated according to different schemes have become 'isolated islands', rendering cross-corpora studies virtually impossible. Consequently, it is desirable that either a standard annotation scheme be agreed upon in this field, or flexible systems be designed that can readily adapt themselves to different annotation schemes.
The tagger described in this chapter, AUTASYS, was designed by Alex Chengyu