Speech Corpus Generation from DVDs of Movies and TV Series

By Kepuska, Veton Z.; Rojanasthien, Pattarapong | Journal of International Technology and Information Management, January 2011 | Go to article overview

Speech Corpus Generation from DVDs of Movies and TV Series


Kepuska, Veton Z., Rojanasthien, Pattarapong, Journal of International Technology and Information Management


INTRODUCTION

A term speech corpus refers to a database of speech data including audio files and corresponding text transcriptions. The idea of corpus generation from DVDs of movies and TV series is inspired from the study of prosodic analysis of Wake-Up-Word technology (Kepuska & Klein, 2009; Kepuska & Shih, 2010; Kepuska, Gurbuz, Rodriguez, Fiore, Carstens, Converse, & Metcalf, 2008). Their studies showed that the prosodic features extracted from pitch of WUW-II speech corpus did not yield the expected result. That was due to the nature of the corpus they used; it was created by having people read transcripts and recorded via telephone (landline, speaker-phone, mobile, etc.), which made the corpus less natural than actual speech obtained from conversations between people (Kepuska & Shih, 2010).

Obtaining speech corpora by hiring a professional company or buying (e.g., linguistic consortium: http://www.ldc. upenn.edu/) would cost a significant amount of money as shown in Figure 1 (LDC, University of Pennsylvania, 2009). Creating a corpus from recording a conversation (Aiken, 2009) and then writing corresponding text transcriptions would also be very time consuming and are dependent on the quality of recordings (Musa, 2010). These two methods are typically not feasible for those who have a low research budget and limited amount of time to work on the data collection. That is why using corpus generation from DVDs of movies and TV series is a better option in obtaining speech corpora. Not only the utterances from the corpora are natural speech, it also costs nothing to create a large set of speech corpora assuming the DVDs have already been bought.

Although most of the DVDs of movies and TV series are likely to be protected by copyright-law (Lippert, 2007), this research should not violate this law. That is, 1) there is no public distribution of the viewing contents from the DVDs, and 2) the corpora generated from the DVDs will be used as an "in-house" research for the speech recognition system only (Hemming & Lassi, 2010). The proposed approach should not affect the copyright holders to suffer a loss of profit from selling their DVDs. On the contrary, this corpus generation should increase the sales since more DVDs will be bought as there is a need in creating corpora from them (see Table 1).

The goal of this work is to explore the potential of the concept of data collection from the DVDs of movies and TV series, and to develop the application that utilizes this concept. Each component of the data collection system is described in starting from extracting the data from a DVD all the way to generating a corpus into a proper structure. Data Collection Toolkit is a C# .NET application developed to carry out this task is appropriately described in this section.

Some examples are provided of how to Data Collection Toolkit and speech corpora generated from this system can be used; namely: TIMIT corpus browsing for the study of TIMIT corpus, creating features for prosodic analysis of Wake-Up-Word's context detection, and training of CMU Sphinx's acoustic model.

To verify the quality of speech corpora generated from this system, TIMIT corpus is used for evaluating time markers of the words (where they start and end) from text transcriptions generated by force alignment process. The utterances from TIMIT corpus are also used for evaluating the error rates of the acoustic models trained by corpora generated from DVDs of movies and TV series

COMPONENTS OF DATA COLLECTION SYSTEM

The data collection system consists of five components. The first step of the data collection system begins from selecting a DVD of a movie or TV series. The audio will then be extracted from the video of the DVD. The text from the subtitles will be converted into text transcriptions. The subtitles will also be used for cutting the extracted audio file into small utterances (audio segmentation). …

The rest of this article is only available to active members of Questia

Already a member? Log in now.

Notes for this article

Add a new note
If you are trying to select text to create highlights or citations, remember that you must now click or tap on the first word, and then click or tap on the last word.
One moment ...
Default project is now your active project.
Project items
Notes
Cite this article

Cited article

Style
Citations are available only to our active members.
Buy instant access to cite pages or passages in MLA 8, MLA 7, APA and Chicago citation styles.

(Einhorn, 1992, p. 25)

(Einhorn 25)

(Einhorn 25)

1. Lois J. Einhorn, Abraham Lincoln, the Orator: Penetrating the Lincoln Legend (Westport, CT: Greenwood Press, 1992), 25, http://www.questia.com/read/27419298.

Note: primary sources have slightly different requirements for citation. Please see these guidelines for more information.

Cited article

Speech Corpus Generation from DVDs of Movies and TV Series
Settings

Settings

Typeface
Text size Smaller Larger Reset View mode
Search within

Search within this article

Look up

Look up a word

  • Dictionary
  • Thesaurus
Please submit a word or phrase above.
Print this page

Print this page

Why can't I print more than one page at a time?

Help
Full screen
Items saved from this article
  • Highlights & Notes
  • Citations
Some of your highlights are legacy items.

Highlights saved before July 30, 2012 will not be displayed on their respective source pages.

You can easily re-create the highlights by opening the book page or article, selecting the text, and clicking “Highlight.”

matching results for page

    Questia reader help

    How to highlight and cite specific passages

    1. Click or tap the first word you want to select.
    2. Click or tap the last word you want to select, and you’ll see everything in between get selected.
    3. You’ll then get a menu of options like creating a highlight or a citation from that passage of text.

    OK, got it!

    Cited passage

    Style
    Citations are available only to our active members.
    Buy instant access to cite pages or passages in MLA 8, MLA 7, APA and Chicago citation styles.

    "Portraying himself as an honest, ordinary person helped Lincoln identify with his audiences." (Einhorn, 1992, p. 25).

    "Portraying himself as an honest, ordinary person helped Lincoln identify with his audiences." (Einhorn 25)

    "Portraying himself as an honest, ordinary person helped Lincoln identify with his audiences." (Einhorn 25)

    "Portraying himself as an honest, ordinary person helped Lincoln identify with his audiences."1

    1. Lois J. Einhorn, Abraham Lincoln, the Orator: Penetrating the Lincoln Legend (Westport, CT: Greenwood Press, 1992), 25, http://www.questia.com/read/27419298.

    Cited passage

    Thanks for trying Questia!

    Please continue trying out our research tools, but please note, full functionality is available only to our active members.

    Your work will be lost once you leave this Web page.

    Buy instant access to save your work.

    Already a member? Log in now.

    Search by... Author
    Show... All Results Primary Sources Peer-reviewed

    Oops!

    An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.