Speech Recognition, Understanding, Not Same

Article excerpt


Odd things are afoot. IBM is starting a major research push to make computers understand("recognize,"actually) speech better than people do by 2010. That's ambitious. They may do it, though.

It's important here to understand the difference between "speech recognition," which is what IBM is up to, and the "understanding" of speech. Speech recognition means knowing what words were spoken, nothing more. If you say to a speech-recognizer, "What is art?" it will print out exactly that. It won't have the foggiest idea what it means. Making a computer understand speech, in the sense of analyzing grammar and handling context, is a bear of a problem. Recognizing speech is a different and more tractable problem.

Pretty good speech-recognition software is available for PCs now, at least for recognizing clear English in a quiet environment. The leaders are Naturally Speaking, from Dragon Systems, and ViaVoice, from IBM. They work - pretty well, anyway.

I've used Naturally Speaking. After installing it, you read into a microphone passages it displays on-screen, so that it can learn your voice and inflections. With occasional mistakes, the frequency of these depending on a number of factors, it sure enough prints what I say. Mostly.

But it is not speaker-independent: It will work only for those to whose voices it has been trained. Speaker-independence - meaning that it will quickly adapt itself to anyone's voice - is an important goal. IBM (and everybody else in the field) is chasing it.

Another big problem is noisy environments. Today's software is thrown off by smallish amounts of background noise. The trading floor at the New York Stock Exchange would be hopeless. But that's the kind of problem the project is tackling. How to make it work?

I talked to David Nahamoo, who is honcho of IBM's effort. He said that one promising approach is to combine lip reading with acoustic analysis. Today computers with cameras are quite able to watch the movements of a speaker's mouth. (Remember the flap over face-recognizing cameras at football games.) As an artificial example, suppose a speaker in a noisy environment said, "Pool." The acoustic-analysis software might miss part of it, and not know whether the speaker had said "pool" or "ghoul." Because the lip positions differ, the camera would allow the computer to make the distinction. …