the feature calculation stage. Then, by observing the speech intensity, i.e., volume, the speech is extracted. From the extracted speech, feature parameters are
extracted and arranged as an output of the feature extraction stage. This output is
fed into the emotion recognition part. In the emotion recognition part, a combination of plural neural networks, each of which is designed and trained to recognize
a specific emotion included in speech, receive feature parameters and carries out a
recognition process (Fig. 2).
Fig.2 Emotion recognition part configuration.
2.2 Emotion recognition experiment
|1. ||Speech database collection|
It is necessary to train each of the sub-networks for the recognition of emotions.
The most important and most difficult issue for neural network training is how to
collect a large amount of speech data containing emotions. As our target is content-independent emotion recognition, we adopted one hundred phoneme balanced
words for a training word set. Since it is difficult for ordinary people to intentionally utter them with emotions, we have adopted the following strategy.
|A. ||First we ask a radio actor to utter one hundred words with each of the eight
emotions. As a professional, he is used to speaking various kinds of words, phrases,
and sentences with intentional emotions.|
|B. ||Then, we ask speakers to listen to each of these utterances and mimic the tones
of each utterance. We record the utterances spoken by ordinary people.|
Since our target is speaker-independent as well as content-independent emotion
recognition, the following utterances spoken by many speakers were prepared for
the training process.
100 phoneme-balanced words|
fifty male speakers and fifty female speakers|
neutrality, anger, sadness, happiness, fear, surprise, disgust,|
Each speaker uttered 100 words eight times using different|
|2. ||Training and recognition experiment|
We used thirty speakers for training out of the fifty speakers used for data collection. To learn the effect of the number of speakers used for the training, we
carried out five types of neural net trainings using one to thirty speakers. In
addition, we carried out two types of recognition experiments to evaluate the performance of the obtained neural networks.
Questia, a part of Gale, Cengage Learning. www.questia.com
Book title: Human-Computer Interaction:Communication, Cooperation, and Application Design.
Contributors: Hans-Jörg Bullinger - Editor, Jürgen Ziegler - Editor.
Publisher: Lawrence Erlbaum Associates.
Place of publication: Mahwah, NJ.
Publication year: 1999.
Page number: 143.
This material is protected by copyright and, with the exception of fair use, may
not be further copied, distributed or transmitted in any form or by any means.