Academic journal article Human Factors

Audio and Visual Cues in a Two-Talker Divided Attention Speech-Monitoring Task

Academic journal article Human Factors

Audio and Visual Cues in a Two-Talker Divided Attention Speech-Monitoring Task

Article excerpt

INTRODUCTION

In recent years, improvements in data transmission technology have dramatically reduced the cost of telecommunications bandwidth. To this point, however, relatively little effort has been made to exploit this low-cost bandwidth in improved speech communications systems. In part, this apparent oversight reflects the fact that standard telephone-grade audio speech (with a bandwidth of roughly 3500 Hz) already produces near 100% intelligibility for typical telephone conversations involving a single talker in a quiet listening environment. There is, however, ample opportunity for higher-bandwidth speech communication systems to improve performance in complex listening tasks that involve more than one simultaneous talker. High-bandwidth multichannel speech communication systems could have a wide variety of applications, ranging from simple three-way conference calling to sophisticated command and control tasks that require listeners to monitor and respond to time-critical information that could be present in any one of a number of simultaneously presented competing speech messages.

A question of practical interest, therefore, is how additional bandwidth could best be allocated to improve the effectiveness of multichannel speech communications systems. The most obvious approaches to this problem involve the restoration of the audio and visual cues that listeners rely on to segregate speech signals in real-world multitalker listening environments, such as crowded restaurants and cocktail parties. For example, listeners in the real world rely on interaural differences between the audio signals reaching their left and right ears to help them segregate the voices of spatially separated talkers (see Bronkhorst, 2000, for a recent review of this phenomenon). When these binaural cues are restored to a speech communication signal by adding a second independent audio channel to the system, multitalker listening performance improves dramatically (Abouchacra, Tran, Besing, & Koehnke, 1997; Crispien & Ehrenberg, 1995; Ericson & McKinley, 1997; Nelson, Bolia, Ericson, & McKinley, 1999).

Additional bandwidth could also be used to restore the visual speech cues that are normally available in face-to-face conversations. These cues make it possible to extract some information from visual-only speech stimuli (a process commonly known as speechreading; Summerfield, 1987), and they contribute substantially to audiovisual (AV) speech perception when the audio signal is distorted by the presence of noise (Sumby & Pollack, 1954) or interfering speech (Rudmann, McCarley, & Kramer, 2003).

From earlier experiments, it is clear that multitalker listening performance can be improved both by the addition of binaural spatial audio cues and by the addition of visual speech information. However, relatively little is known about how audio and visual information might interact in high-bandwidth multichannel AV speech displays. Important research issues related to this topic include the following:

Divided attention versus selective attention in AV speech perception. An essential underlying assumption in the design of a multitalker speech display is that neither the system nor the operator will have reliable a priori knowledge about which talker will provide the most important information at any given time. (Otherwise, either the system or the operator would simply turn off the uninformative talkers.) Consequently, it is important to know how well listeners are able to divide their attention across the different talkers in an AV speech stimulus in order to extract important information that might originate from any one of the competing speech signals. However, virtually all experiments that have examined AV speech perception with more than one simultaneous audio speech signal (Driver, 1996; Driver & Spence, 1994; Reisberg, 1978; Rudmann et al., 2003; Spence, Ranson, & Driver, 2000) have examined performance in a selective attention paradigm in which the participants were provided with a priori information about which talker to attend to and which talker to ignore prior to the presentation of each stimulus. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.