Kevin H. Richardson
AT&T, Rm D2-2C34,200 Laurel Avenue, Middletown NJ; USA 07748
With the advent of automatic call handling, AT&T, and more specifically, the User Interface community, has devoted a great deal of time and resources to the development, modification, coordination and storage of recorded human speech. One solution to all these problems lies in the development of Text-To-Speech (TTS) technology. TTS holds the promise of not only eliminating the need for recorded voice announcements but also of providing a common AT&T "sound". The potential benefits are obvious. However, the traditional drawbacks of TTS technology are also obvious: It doesn't sound as natural as recorded human speech (and is occasionally difficult to understand) and users don't like it.
One of the keys to "naturalness" and user acceptance of TTS-generated announcements lies in its ability to adequately model the prosodic, or stylistic, elements of human speech (e.g., stress patterns, intonation, rate and rhythm). The psychological literature has demonstrated quite clearly that prosodic information plays an important role in both how natural a phrase sounds as well as how easy it is to recognize, understand and remember ( Diehl, Souther, & Convis, 1980; Martin, 1972; Miller, 1981). Current research by AT&T Labs has resulted in much more understandable and natural sounding "next generation" TTS algorithms (see Beutnagel, Conkie, Schroeter, Stylianou & Syrdal, 1998). In order to successfully implement TTS in place of recorded announcements, two studies were conducted to understand the prosodic differences between traditional TTS, AT&T's "NextGen" TTS, and recorded human speech.
The first study determined how successful researchers at AT&T Labs have been in bridging the "prosody gap" between Text-To-Speech and human speech. Thirty-nine participants between the ages of 16 and 70 with no reported speech