Search across Different Media: Numeric Data Sets and Text Files

Search across Different Media: Numeric Data Sets and Text Files

Digital technology encourages the hope of searching across and between different media forms (text, sound, image, numeric data). Topic searches are described in two different media: text files and socioeconomic numeric databases and also for transverse searching, whereby retrieved text is used to find topically related numeric data and vice versa. Direct transverse searching across different media is impossible. Descriptive metadata provide enabling infrastructure, but usually require mappings between different vocabularies and a search-term recommender system. Statistical association techniques and natural-language processing can help. Searches in socioeconomic numeric databases ordinarily require that place and time be specified.


A hope for libraries is that new technology will support searching across an increasing range of resources in a growing digital landscape. The rise of the Internet provides a technological basis for shared access to a very wide range of resources. The reality is that network-accessible resources, like the contents of a well-stocked reference library, are quite heterogeneous, especially in the variety of indexing, classification, categorization, and other forms of metadata. However, the use of digital technology implies a degree of technical compatibility between different media, sometimes referred to as "media convergence," and these developments encourage the prospect of being able to search across and between different media forms--notably text, images, sound, and numeric data sets--for different kinds of material relating to the same topic. To examine the practical problems involved, the authors undertook to demonstrate searching between and across two different media forms: text files and socioeconomic numeric data sets. (1)

Two kinds of search are needed. First, it should be possible to do a topical search in multiple media resources, so that one can find, for example, both pertinent factual numeric data and relevant discussion. (One difficulty is that the vocabulary used to classify the numeric data is ordinarily quite different from the subject headings used for books, magazine articles, and newspaper stories about the same topic.) Second, when intriguing data values are encountered, one would like to move directly to topically relevant texts. Likewise, when a questionable statement is read, one would like to be able to find relevant statistical evidence. Therefore, there needs to be search support that facilitates such transverse searching among resources, establishing connections, transferring data, and invoking appropriate utilities in a helpful way.

Both problems were addressed through the design and demonstration of a gateway providing search support for both text and socioeconomic numeric databases. First, the gateway should help users conduct searches in databases of different media forms by accepting a query in the searcher's own terms and then suggesting the specialized categorization terms to search for in the selected resource. Second, if something interesting was found in a socioeconomic database, the gateway would help the searcher to find documents on the same topic in a text database, and vice versa. Selection of the best search terms in target databases is supported by the use of indexes to the categories (entries, headings, class numbers) in the system to be searched. These search-term recommender systems (also known as "entry vocabulary indexes") resemble Dewey's "Relativ Index," but are created using statistical association techniques. (2)

Four characteristics of this investigation need to be noted:

1. Searching independent sources: The authors were not concerned with ingesting resources from different sources into a consolidated local data repository and searching within it. The interest lay, instead, in being able to search effectively in any accessible resource as and when one wants. This implies that interoperability issues in dealing with the native query languages and metadata vocabularies of remote repositories can be solved. …

