Magazine article Online

Text Retrieval Technologies for Image (and Other) Databases

Magazine article Online

Text Retrieval Technologies for Image (and Other) Databases

Article excerpt

My last column discussed some of the basics of imaging technology. We saw that databases containing bitmapped images are not searchable; they must be converted to text (ASCII) format first by using optical character recognition (OCR) techniques. I ended the column with a caveat that the result of such a process may be a mass of image and textual data with little structure. In this column I will consider how to deal with that problem.

It is becoming increasingly common that documents are being imaged and then converted to text files by OCR. No problem," you say, just mount the text file on an online retrieval system as a private database and use standard text searching techniques on it." Unfortunately it's not that simple because:

Full text searching is difficult for information professionals, let alone novice users, as Carol Tenopir and others have pointed out so well [1].

Many users do not wish to lose control of their databases by exporting them to an unfamiliar system and then being required to access their data using dial-up lines.

For cost or proprietary reasons, users may not wish to use an external system to maintain their data.

Many text-based information retrieval systems cannot handle images or hyperlinks between text and images.

These considerations have spurred the appearance of an array of text management systems; many of them are specifically equipped to interface with imaging systems and handle OCR output.

TEXT MANAGEMENT

Even if a database contains only images and is not suitable for OCR conversion, an associated text database and text management system are usually still necessary. Whatever the database contains graphics, pictures, video clips, or even voice segments (as in a full-fledged multimedia system) - an accompanying text database is the entry vehicle. Text, not images, is the medium we use to describe the contents of the database and create and deliver the indexes. Indexing considerations are often neglected in the design of document imaging systems. In an excellent article entitled The Dark Side of Document Image Processing," Locke [31 eloquently presents the case for text management as an integral part of any imaging system involving documents. (Locke's article is part of a BYTE theme issue on imaging. I recommend the entire series of papers as worthwhile reading for anyone interested in learning more about document imaging systems.) Several points from Locke's article are well worth quoting here; they make a strong case for text management systems.

.. the subject analysis that librarians perform to create categories and relationships is strongly akin to what the AI literature calls knowledge engineering.

Apart from OCR, manual indexing is the only way to tie these pictures to meaningful concepts by which they might be retrieved.

Unless you have already developed a comprehensive controlled vocabulary .. you are either leaving a big hole in your imaging budget or designing an information system that will stonewall everyone who uses it.

What documents are about simply can I be captured by casually jotting a few words in a database header or even by dumping all the discrete words the documents contain into an inverted file index.

Effective retrieval is the only rational reason to build text/image systems.

TEXT RETRIEVAL TECHNOLOGIES

Text management systems share some of the capabilities of the online retrieval systems commonly in use in many libraries and information centers today. However, they are specifically designed to handle large volumes of text (often on a real-time basis as with a newswire feed) and be used by novices, not information professionals. To understand the concepts upon which text management systems are based, it is helpful to consider four major text retrieval technologies (Table 1).

Word searching (sometimes called keyword searching") is the simplest form of retrieval; the user enters the character string being sought, and the computer finds every occurrence of that string. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.