Academic journal article Communications of the IIMA

A Comparative Analysis of Speech Recognition Platforms

Academic journal article Communications of the IIMA

A Comparative Analysis of Speech Recognition Platforms

Article excerpt

INTRODUCTION

Speech recognition (also known as automatic speech recognition) converts spoken words to text (Jurafsky & Martin, 2000). It is a broad term which means it can recognize almost anybody's speech--such as in a call centre system designed to recognize many voices. The performance of speech recognition systems is usually specified in terms of accuracy and speed (Allen, 1995). Accuracy may be measured in terms of performance accuracy which is usually rated with word error rate (WER), whereas speed is measured with the real time factor. Substantial efforts have been devoted in the last decade to the test and evaluation of speech recognition in fighter aircraft, or the training for military (or civilian) air traffic controllers (ATC). Speech Recognition in the field of telephony is now commonplace and in the field of computer gaming and simulation is becoming more widespread (Flach, 2004). People with disabilities are another part of the population that benefit from using speech recognition programs.

In telecommunications, Interactive Voice Response (IVR) systems allow customers to access a company's database via a telephone touchtone keypad or by speech recognition. They can respond with pre-recorded or dynamically generated audio to further direct users on how to proceed. Often they are used to control functions where the interface can be broken down into a series of simple menu choices. In telecommunications applications, such as customer support lines, IVR systems generally scale well to handle large call volumes. However, the use of such systems is significantly impacted by several extraneous factors; among which are noise, accent, casual speech styles and medium of speaking (Handset, headset, speakerphone, and cell phone). However, there have been significant advances in recent years. Automatic speech recognition capabilities now permit us to use speech as an interface for dictation and for information access. In such applications, the interactivity of the system should be such that the user experience is as good as a human experience; otherwise users will drift away from its use. IVR technology is also being introduced into automobile systems for hands-free operation. Current deployment in automobiles revolves around satellite navigation, audio and mobile phone systems. This paper is an attempt to compare three different platforms in terms of their ability to adequately recognize the utterances made by callers into an airline agency, and classify them correctly. Essentially, we are trying to determine how well the utterances collected, are recognized, and properly classified into their stipulated categories.

Consider an airline agency in which there are six (6) possible options available to the caller as shown in Table 1 below.

Table 1: Table of caller options.

1: Reservations    To make reservations for a flight

2: Flight Status   To obtain information about a flight

3: Reconfirmation  To reconfirm a flight

4: Agent           To select a seat on a flight with existing
                   reservation

5: Seat Selection  To speak to an agent

6: Other           Any other utterance

For each of these options, a grammar base is developed to accommodate different possible permutations for a caller to indicate their intent--so as to avoid re-prompts. An utterance like "Make a Reservation" will have several alternative forms that are deemed to be synonymous caller inputs as shown in the Table 2 below:

Table 2: Confidence thresholds for utterance classification.

Platform        High        Medium        Low
             Confidence   Confidence   Confidence
             Threshold    Threshold    Threshold

Platform A   0.4-1.00       0.3-0.4      0-0.3
Platform B   0.3-1.00      0.25-0.3      0-0.25
Platform C   0.4-1.00      0.25-0.3      0-0.25

The quality of the recognizer is enhanced by a large grammar base. The larger the grammar base, the more efficient the recognizer, resulting in a higher probability that the recognition will occur at the first attempt--thereby reducing the number of re-prompts. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.