The air was crisp, the days sunny. The scene was Rockefeller University in its private park-like setting just yards from the hurley-burley traffic and noise of New York City. The occasion: RIAO 94 (Organized by Centre de Hautes Etudes Internationales D'Informatique Documentaire [C.I.D., France] and Centre for Advanced Study of Information Systems, Inc. [C.A.S.I.S., U.S.]) where researchers and practitioners from many countries gathered to discuss "Intelligent Multimedia Information Retrieval Systems and Management"--prototypes, operating systems, and applications. The dates were October 11-13.
Ever since my chapter on Electronic Image Information appeared in the Annual Review of Information Science and Technology (ARIST) in 1987, I have been looking for people who are retrieving images by their content rather than by keywords or ID numbers. And here they were at RIAO. In fact, one whole afternoon was given to multimedia information representation and retrieval. (Much of the conference dealt with text.)
Content Based Image/Video Retrieval
Several presentations focused on the extraction and representation of the content of video clips and images. These were done well enough for a computer to select material that met the needs of a wide range of users and purposes.
Images and multimedia present a multitude of challenges in storage and retrieval. One approach to solving these problems was described by Alex Pentland, The Media Lab, M.I.T. Their Photobook System is a set of image tools that codes specific sorts of things appearing in video--people, textures, cars, etc. This procedure is "similar to model-based or semantic coding," Pentland explained, "but with the additional constraint that items with similar appearance have similarly coded representation."
Key frames can be sorted depending on their content, and motion can be analyzed. Detectors are tuned to find specific things in the video like camera motion, foreground, background, and people's faces.
The project is part of The Media Lab's research with video and images, funded by BT (British Telecom). Some practical applications: in the video services, a tool allows travel agents to search and present information to customers about potential vacation spots. In the image services, a tool allows customers like department store buyers, police departments, or dating services to browse large image databases. A third service is models from video--a tool that allows architects, city planners, and film makers to construct realistic computer models from videotape.
To retrieve video sequences in the travel project, for example, the computer must have some way of knowing what is in the video. The researchers used stream-based annotation to describe what is happening in the video over time. Media Streams, a system developed by M.I.T.'s Marc Davis and Kenneth Haase uses an iconic annotation language to represent knowledge about the video content.
The system includes Power Assisted Annotation and Power Assisted Presentation. In the former, the user can bring up frames and annotate it or indicate what is not wanted. It might be the color of the sky that's unacceptable. By combining stream-based annotation of video content with memory-based representation, the researchers can capture the semantic structure of the video. Tools that understand enough about the video content help with the annotation process.
Power Assisted Presentation includes retrieval of multimedia material but not in the manner of retrieval in a standard database. Rather, it is a process of story telling or of creating a presentation. In storytelling, temporal media convey meaning by sequencing video and sound elements to tell the story. Here the goal is to make it possible for computers to build coherent video narratives using story models by humans.
Video presentation can be created, Pentland explained, by filling out a story template with video whose semantics satisfy the constraints imposed by the template. …