Intelligibility of Speech in a Virtual 3-D Environment
MacDonald, Justin A., Balakrishnan, J. D., Orosz, Michael D., Karplus, Walter J., Human Factors
The number of simultaneously active sound feeds often limits the ability of human operators to interpret and respond to auditory messages in electronic communication systems (e.g., telecommunications). Cockpits and air traffic control rooms, military and police radio communications, and teleconferencing are examples of common situations in which many different sounds might become important at one time or another and the listener must be able to selectively attend to one or more of them while entirely or partially tuning out some or all of the others. Controlling volume levels of different channels in the system, following or enforcing communication protocols, and directing attention efficiently to a given sound ensemble can be especially difficult during the chaotic interchanges typical of a mission-critical or crisis situation.
Recently, the three-dimensional (3-D) auditory displays of virtual reality have been advocated as a means of facilitating this multichannel listening process (e.g., Begault & Wenzel, 1993; Burdea, Richard, & Coiffet, 1996; Doll & Hanna, 1995; Doll, Hanna, & Russotti, 1992; King & Oldfield, 1997; Noro, Kawai, & Takao, 1996; Ricard & Meirs, 1994). Instead of piping all sound channels directly into speakers or earphones at equal volumes, a specialized sound card is used to create the illusion that the different sounds in the system originate from different locations in the space surrounding the listener. The 3-D effect is achieved by first introducing a time delay to simulate the interaural arrival time differences associated with the different distances of a sound from the left and right ear. The two ear channels are then subjected to a series of preprogrammable filters (the head-related transfer function, or HRTF) to simulate the effects of the head, pinnae, and torso on a waveform (e.g., Begault, 1994; Begaul t & Wenzel, 1993; Wenzel, Arruda, Kistler, & Wightman, 1993). Differences in the sound pressure level and arrival times of the waveform at the two ears provide lateral direction cues, and effects of anatomical structure (e.g., pinnae shape) provide information about elevation and position on the transverse (front/back) axis. The effects of head shadow on the intensity of stimuli at the ears become more pronounced in higher-frequency regions, whereas the interaural time difference (ITD) is more noticeable at lower frequencies.
Separating sounds in both real and virtual spaces has been shown to increase the intelligibility of speech paired with a noise masker (e.g., Doll & Hanna, 1995; Ricard & Meirs, 1994; Saberi, Dostal, Sadralodabai, Bull, & Perrott, 1991) and with interfering speech (e.g., Dirks & Wilson, 1969; Ericson & McKinley, 1997; Yost, Dye, & Sheft, 1996). As one might expect, the most effective 3-D simulation is usually obtained using HRTFs that take into account the unique anatomy of the listener, including the size of the listener's head and the shape of the pinnae. In many applied situations, these individualized filters would be impractical. Fortunately, however, nearly equivalent results can be achieved with a single nonindividualized set of filters (Wenzel et al., 1993), which can be derived from a model of the human head such as the Knowles Electronics Mannequin for Acoustics Research (Knowles Electronics, Inc., Itasca, Illinois) or from a listener who is particularly good at localizing sounds (Begault & Wenzel, 1 993). Begault and Wenzel showed that the performance of poor sound localizers can be increased by replacing their HRTFs with those of an exceptionally good localizer.
The main disadvantages of nonindividualized HRTFs are increased frequency of front/back reversals and more difficulty in simulating elevation (Wightman & Kistler, 1989). The same kinds of benefits for 3-D sound displays over monophonic listening, however, have been observed using both individualized and nonindividualized methods. …