Academic journal article Information Technology and Libraries

Library Systems and Unicode: A Review of the Current State of Development

Academic journal article Information Technology and Libraries

Library Systems and Unicode: A Review of the Current State of Development

Article excerpt

Unicode, a standard developed in 1991, defines a universal character set for encoding the characters in the scripts of the world's languages. Unicode implementation has been gaining momentum in recent years especially in the software and computer industry. Academic libraries with collections of materials in multiple languages will want to take advantage of Unicode for display and searching of materials in non-Latin scripts such as Arabic, Hebrew, and Chinese. The focus of this article is a review of Unicode and its incorporation in library systems.

**********

Unicode is a standard for a universal character set for encoding the characters in the scripts of the world's languages. It is a fully compatible 16-bit version of an international standard developed by the International Organization for Standardization (ISO) and the International Electrotechnical Committee (IEC), known as ISO/IEC 10646 or the Universal Multiple-Octet Coded Character Set. The Unicode standard was developed by a consortium of interested parties, known as the Unicode Consortium, composed mainly of such computer industry giants as IBM, Apple Computer, Adobe Systems, and Microsoft. However, the library world also had a stake in its development. Research Libraries Group (RLG), Online Computer Library Center (OCLC), and several library system companies are members of the consortium.

Although the earliest version of the Unicode standard was published in 1991, it is just beginning to be fully incorporated into many systems. Its use is prevalent in the computer industry in such programming languages as Sun Microsystems's Java and such operating systems as Microsoft Windows NT and Apple Computer's Mac OS 8.5. Popular Web browsers, including Microsoft Internet Explorer, Netscape Navigator, and the latest version of Opera also support Unicode. With Web browsers that can be set to view Unicode, switching back and forth between different character sets is no longer necessary. Taking advantage of the Unicode character set has become easier in recent years with the availability of Unicode fonts, such as Code 2000 and Code 2001, a shareware font produced by James Kass, and Microsoft's Arial Unicode MS. (1)

This standard is important to libraries that collect materials in many different languages and want to be able to display the native scripts in their Web catalogs as well as allow users to search by typing in the native scripts. It is especially useful when trying to display a record that may have multiple scripts, such as a record that may contain both Arabic and Hebrew. The idea behind Unicode was to develop one international character set for all of the scripts of the world's languages. One unique code would represent each character, even if that character were used in multiple languages. This could replace the multiple older and sometimes incompatible character sets that are presently in use, including the American Standard Code for Information Interchange (ASCII) and the Extended Binary Coded Decimal Interchange Code (EBCDIC) used by the computer industry; ISO 8859 character sets, such as Latin 1; and other character sets used in MARC records, such as the East Asian Character Code (EACC) set. Unlike most of the older 7- and 8-bit character sets, which were limited to 256 characters or less, the Unicode standard is based on 16 bits, allowing more than 65,000 characters to be encoded.

When it comes to transforming Unicode characters into bits and bytes that can be stored or transmitted by computer systems, there are three encoding forms that can be used to comply with the standard: UTF-8, UTF-16, and UTF-32. UTF-8 uses from one to four 8-bit code units or bytes. The big advantage of UTF-8 is that ASCII and its equivalent Unicode characters have the same value, making it more compatible with existing software. Most Web browsers also support UTF-8. UTF-16 uses one to two 16bit code units, and UTF-32 uses a single 32-bit code unit. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.