Academic journal article Information Technology and Libraries

The Syllables in the Haystack: Technical Challenges of Non-Chinese in a Wade-Giles-to-Pinyin Conversion. (Communications)

Academic journal article Information Technology and Libraries

The Syllables in the Haystack: Technical Challenges of Non-Chinese in a Wade-Giles-to-Pinyin Conversion. (Communications)

Article excerpt

This paper describes the technical challenges of developing software to convert Wade-Giles to Pinyin in bibliographic records that are not in Chinese.

**********

The Chinese language is different from alphabetical languages to which most Westerners are accustomed. To represent items using such a language in a bibliographic database employing principally roman script requires some form of converting the original to a representation in alphabetic characters and possibly diacritics. Systems of such transliteration for Chinese date from at least 1605, but one prevalent in the last hundred years or so in the United States is the Wade-Giles (WG) system. (1)

Recently the Library of Congress (LC) decided to discontinue use of WG and adopt the newer Pinyin form of transliteration, adopted by the People's Republic of China in the late 1950s. This meant conversion of Chinese records in the Online Computer Library Center (OCLC) authority file and OCLC bibliographic file to Pinyin.

This evolved into a consortial effort among LC, Research Libraries Group (RLG), and OCLC, an effort extending over three years. Earlier efforts by the OCLC Office of Research have been reported elsewhere.(2)

Background

Once requests for comments and discussions with key libraries had taken place, there were major parts to the conversion effort to plan:

1. LC conversion of Chinese authorities by OCLC, scheduled to take place not later than October 1, 2000, "Day One"

2. Conversion of LC bibliographic records (bibliographic records) by RLG

3. Bibliographic records conversion by OCLC and RLG of their respective union catalogs

4. Conversion by OCLC of the non-Chinese records containing Chinese text, and later by RLG of similar records in their databases

5. Conversion efforts by OCLC and RLG of records of institutions from WG to Pinyin

Development of the Specifications

These were developed cooperatively as the project progressed. The specifications can be seen at the LC Web site. (3) Some general points to keep in mind about the conversion:

1. Only fields/subfields specified by LC in the specs were to be converted.

2. Conversions made heavy use of dictionary lookups, not only for conversion of WG syllables to Pinyin counterparts, but also for phrase matching as in place names. The conversion sequences, which were directions for specific types of conversion such as geographic place names or Taiwan names, dealt with special types of translations. The dictionaries for these sequences were generally organized in the form of longest to shortest entries, to allow the most complete phrase matching.

The Dictionaries

The Standard Dictionary (STD) was the complete list of the more than four hundred WG syllables and their Pinyin forms. This was searched after all other conversion sequences had a chance to do special matching.

Anyone familiar with WG and Pinyin romanization schemes knows that while diacritics and initial letters are usually enough to signal WG to the human eye, in fact the Pinyin and WG schemes have considerable overlap. In some cases these syllables are uniquely romanized for WG and Pinyin; in other cases the same syllables are romanized in just the same way; and in yet others, syllables spelled the same way in WG and Pinyin represent different sounds in each scheme. In testing early conversions, it quickly became evident that any automated conversion scheme needed to distinguish WG that could only be WG from WG that could also be Pinyin, or could be a common match that could be either.

As if this were not complex enough, it became apparent that the overlap of WG Chinese with other languages could lead to erroneous conversion of other languages to Pinyin. Systems of safeguards were developed.

One safeguard in the specification broke down the STD into four subdictionaries: Unique WG, Unique Pinyin, Same syllable in both, and Common (same spelling but different sound). …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.