Academic journal article Perception and Psychophysics

Similarity Structure in Visual Speech Perception and Optical Phonetic Signals

Academic journal article Perception and Psychophysics

Similarity Structure in Visual Speech Perception and Optical Phonetic Signals

Article excerpt

A complete understanding of visual phonetic perception (lipreading) requires linking perceptual effects to physical stimulus properties. However, the talking face is a highly complex stimulus, affording innumerable possible physical measurements. In the search for isomorphism between stimulus properties and phonetic effects, second-order isomorphism was examined between the perceptual similarities of video-recorded perceptually identified speech syllables and the physical similarities among the stimuli. Four talkers produced the stimulus syllables comprising 23 initial consonants followed by one of three vowels. Six normal-hearing participants identified the syllables in a visual-only condition. Perceptual stimulus dissimilarity was quantified using the Euclidean distances between stimuli in perceptual spaces obtained via multidimensional scaling. Physical stimulus dissimilarity was quantified using face points recorded in three dimensions by an optical motion capture system. The variance accounted for in the relationship between the perceptual and the physical dissimilarities was evaluated using both the raw dissimilarities and the weighted dissimilarities. With weighting and the full set of 3-D optical data, the variance accounted for ranged between 46% and 66% across talkers and between 49% and 64% across vowels. The robust second-order relationship between the sparse 3-D point representation of visible speech and the perceptual effects suggests that the 3-D point representation is a viable basis for controlled studies of first-order relationships between visual phonetic perception and physical stimulus attributes.

Speech production biomechanics generate optical phonetic as well as acoustic phonetic signals, and humans typically integrate the information afforded by both. A growing list of audiovisual phenomena demonstrates the influence of visual speech stimuli on speech perception. The well-known McGurk (McGurk & MacDonald, 1976) and ventriloquist (De Gelder & Bertelson, 2003) effects demonstrate audiovisual integration. Being able to see a talker produces substantial gains to comprehending acoustic speech in noise (MacLeod & Summerfield, 1987; Sumby & Pollack, 1954), improvements in comprehending difficult messages even under good listening conditions (Arnold & Hill, 2001; Reisberg, McLean, & Goldfield, 1987), speech detection under adverse acoustic signal-to-noise conditions (Bernstein, Auer, & Takayanagi, 2004; Grant, 2001; Grant & Seitz, 2000), compensation for auditory speech information that is reduced by filtering out various frequency bands (Grant & Walden, 1996) or by hearing loss (Erber, 1975; Grant, Walden, & Seitz, 1998), and superadditive levels of speech perception from combinations of extremely minimal auditory speech information and visible speech (Breeuwer & Plomp, 1986; Iverson, Bernstein, & Auer, 1998; Kishon-Rabin, Boothroyd, & Hanin, 1996; Moody-Antonio et al., 2005).

Visual speech stimuli alone afford reduced phonetic information, relative to auditory speech stimuli that are presented under good listening conditions. Speech production activities are partially occluded from view by the lips, cheeks, and neck (e.g., hidden from view is vocal fold vibration, related to the phonological voicing distinction; partially hidden is the type of vocal tract closure made by the tongue, related to the phonological manner distinctions; and hidden is the state of the velum, related to nasality). As a result, fairly systematic, although far from invariant, clusters of confusions (e.g., /m, b, p/) among visual speech segments are regularly observed (cf. Kricos & Lesner, 1982; Owens & Blazek, 1985; Walden, Prosek, Montgomery, Scherr, & Jones, 1977). Fisher (1968) coined the term viseme to capture this sort of perceptual similarity among groups of phonemes. Visemes are sometimes regarded as unitary perceptual categories, having no internal perceptual structure that conveys additional phonemic information (e. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.