Sophisticated Searches of the World Wide Web: Distributed Search Systems

Article excerpt

Serious information seekers have long agreed that finding information on the Internet is a hit or miss proposition. Those of us accustomed to the features of major online services, or even the card catalog, are frustrated with the lack of structure and organization Internet resources usually have, AS well as with the uneven quality of their contents. Particularly on the World Wide Web, resources appear and disappear with no warning, location names change, and hyperlinks get broken. Even if intact, links may provide little indication to the worth 6f a particular resource until users display and read them, a time consuming activity, at best.

Furthermore, while a few information finding tools have evolved on the Internet, most of them search only the titles of documents and feel flimsy after the robust search engines professional searchers use. We pine for a solution which would magically search World Wide Web sites with a "real" search engine that handles full text. We dream of the day we could search all the World Wide Web, even all the Internet, in one fell swoop. How wonderful if we could just enter a search statement and have thousands of Web sites searched instantly!

Distributed search systems may be the answer to our dilemma. Like KRI/ Dialog's OneSearch, they can search many databases at once. The difference is that OneSearch scans reasonably compatible databases in a single location. Distributed search systems can search databases at several locations, anywhere on the Internet. Like OneSearch, the results of a search return in a single list of hits. Like KRI/ Dialog's Target searches, the results appear merged and listed in the order of most to least relevant, no matter which database they come from.

Distributed search systems for the World Wide Web are a relatively new development. Personal Library Software announced PLWebServer at the Internet World conference in December 1994. Dienst came out with its new, user-friendly version in February 1995. In addition, some experimental systems like Lycos at Carnegie Mellon may evolve into powerful search tools.

Let's be clear at the outset. These products do not try to index all the Internet. In that respect, they differ from Internet search tools like Archie, which keep an index of all file names on a single server. Archie automatically scouts the Internet for names of files and maintains a central index of them. Sampling the Internet is usually a month-long process, so that such indexes are never completely up to date.

In contrast, distributed search systems keep a list of all locations to search, with perhaps a general description of contents, on a central search server. They search these locations only at the moment of a search's initiation. They perform searches simultaneously, in parallel, on each Web site, and merge results at the central search server. This means that every search covers all the contents of each Web site, even those recently added. There's a good reason for this approach. Indexing the Internet is a daunting task, and with the changeability of the Web, keeping the indexing data for a huge number of sites would require enormous storage space as well as an impossible schedule for keeping that index up to date.

PLWebServer

PLWebServer is built on the popular Personal Librarian search software from Personal Library Software and has a comfortable feel to it. Searchers may be familiar with this search engine through use of other commercial services like DataTimes or Congressional Quarterly's Washington Alert Service. Another version is becoming a widely used finding tool on the Web. Last year, PLS added World Wide Web compatibility to its product and recently added the ability to search other PLServer sites anywhere on the Web. WAIS-compatible sites will be added soon.

PLWebServer creates a "logical database" by conducting parallel searches on each remote server. Then it consolidates the search results from each database into one merged list of retrieved items, ranked for relevance. …