Interactions in natural language dialogues are an essential part of human social exchanges, ranging from social conventions such as greetings, to simple question-answer pairs, to task-based dialogues for coordinating activities, topic-based discussions, and all kinds of more open-ended conversations. As a result, the ability of future social and service robots to interact with humans in natural ways (Scheutz et al. 2007) will critically depend on developing capabilities of humanlike dialogue-based natural language processing (NLP) in robotic architectures. However, different from other NLP contexts such as story understanding or machine translation, natural language processing on robots has at least the following six properties: real-time, parallel, spoken, embodied, situated, and dialogue-based.
Real-time means that all processing must occur within the time frame of human processing, both at the level of comprehension as well as production. It also means that constraints will have to be incorporated incrementally as they occur, analogous to human language processing.
Parallel means that all stages of language processing must operate concurrently to mutually constrain possible meaning interpretations and to allow for the generation of responses (such as acknowledgements) while an ongoing utterance is being processed.
Spoken means that language processing necessarily operates on imperfect acoustic signals with varying quality that depends on the speaker and the background noise. In addition to handling prosodic variations, this includes typical features of spontaneous speech such as various types of disfluencies, slips of the tongue, or other types of errors that are usually not found in written texts.
Embodied means that robots have to be able to process multimodal linguistic cues such as deictic terms accompanied by bodily movements, or other gestures that constrain possible interpretations of linguistic expressions. It also means that the robot will have to be able to produce similar gestures that are expected by human interlocutors to accompany certain linguistic constructs.
Situated means that, because speaker and listener are located in an environment, they will have a unique perspective from which they perceive and experience events, which, in turn, has an impact on how sentences are constructed and interpreted. This includes the incremental integration of perceivable context in the interpretation of referential phrases as well as being sensitive to nonlinguistic coordination processes such as the establishment of joint attention.
Dialogue-based means that information flow is not unidirectional but includes bidirectional exchanges between interlocutors based on different dialogue schemes that constrain the possible dialogue moves participants can make at any given point.
While these six aspects present significant challenges for the development of robotic architectures with dialogue capabilities, there are also several advantages to natural language processing on robots that other NLP contexts do not have. For example, spoken natural language exchanges typically consist of shorter sentences with usually simpler grammatical constructions compared to written language (thus making parsing easier and more efficient). Moreover, the employed vocabulary is much smaller and the distribution of sentence types is different (including more commands and acknowledgements, and few declarative sentences compared to written language). Also, different from written texts, perceptual context can be used to disambiguate expressions, and most importantly, ambiguities or misunderstandings in general can often be resolved through subsequent clarifying dialogue. The option to request clarification also allows interlocutors to handle new, unknown expressions naturally.
Since there are many different forms of dialogues that have their own rules and conventions based on social norms and etiquette (such as small talk, interviews, counseling talks, and others) and might, moreover, require tracking of various nonlinguistic aspects (such as contextual information, interlocutor eye gaze and affective as well as other mental states), we focus on task-based dialogues in the article. …