Academic journal article Information Technology and Libraries

The Structure and Content of MARC 21 Records in the Unicode Environment

Academic journal article Information Technology and Libraries

The Structure and Content of MARC 21 Records in the Unicode Environment

Article excerpt

MARC 21 records may be encoded in individual character sets (including ASCII and ANSEL) or in Unicode (as UTF-8). This paper considers the effect of the use of Unicode without any constraints on the structure and data content of MARC 21 records. The case of Model A records where Latin is the preferred script is examined in particular detail.

Over time, the number of individual character sets that may be used in MARC 21 records has grown to twelve (see table 1). (1) Most of these character sets encode a single script, together with punctuation marks, symbols, and so on. Latin, Arabic, and Cyrillic scripts each comprise a basic and an extended character set. East Asian ideographs, Japanese katakana and hiragana, and Korean hangul are encoded in a single multi-byte character set, the East Asian Character Code (EACC). (2) Unicode equivalents have been specified for all of the characters in the individual MARC 21 character sets. (3)

Alternatively, a limited subset of Unicode characters corresponding to characters in the individual MARC 21 character sets may be used. (4) Canadian Aboriginal Syllabics may also be used in MARC 21 records encoded in Unicode. (5) The Association for Library Collections & Technical Services, Library Information & Technology Association, and Reference and User Services Association's Machine-Readable Bibliographic Information Committee (MARBI) continues to work on the technical requirements for the use of Unicode in MARC 21. A second part of the report, "Assessment of Options for Handling Full Unicode in Character Encodings in MARC 21," was discussed by MARBI at its June 2005 meeting. (6) Annex A of the report incorporates the concepts set forth in this paper, which were originally presented in June 2004 at the ALA Annual Conference. (7)

This paper examines the effect of a greatly expanded character repertoire on specific parts of MARC 21 records. What should the structure of the record be when a record is encoded in Unicode rather than in the individual character sets? In particular, how will MARC records accommodate the greatly expanded character repertoire that includes not only additional non-Roman scripts but many more Latin script characters? Are there limits on where these additional characters can be used?

* Models for multi-script records

Historically, libraries in English-speaking areas have distinguished between Latin script and all other scripts, collectively termed "non-Roman" (or, alternatively, "non-Latin"). English is written in Latin script as are many other languages, predominantly those of Europe. Where typographical facilities for non-Roman scripts are unavailable, non-Roman text is usually rendered in Latin letter equivalents, a process called romanization.

To accommodate different needs worldwide, MARC 21 is flexible with respect to the structure of records containing multi-script data. Two record models, designated A and B, are specified by MARC 21. (8) A record can have data in any script in regular fields (Model B), or a record in the preferred script can be augmented with specially designated fields holding other scripts (Model A). For a particular implementation environment, Model A or Model B is chosen.

The largest use of Model A records is for MARC 21 bibliographic records with Latin as the preferred script. The use of Model A for authority records is questionable because of the complex relationships in authority data. The use of Model A for holdings data, classification data, and community information has not been explored. (9)

In figures 1 and 2, the language of cataloging is English, and the language of the monograph being cataloged is Chinese. (Text in these languages is from a record in the online catalog of the Chinese University of Hong Kong. The structural features were added by the author.)

Figure 1 shows fields from a Model A record. The preferred script for the record is Latin. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed


An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.