This article examines the indexing and retrieval rates of five search tools on the World Wide Web. Character strings were planted in the Project Whistlestop Web site. Daily searches for the strings were performed for six weeks using Lycos, Infoseek, AltaVista, Yahoo, and Excite. The HTML title tag was indexed in four of the five search tools. Problems encountered included insufficient and inconsistent indexing.
In the burgeoning world of digital libraries, there is a need to understand how and by what method digital sources are made available through World Wide Web search tools. If you have ever wondered how search tools provide subject access to the millions of Web pages available, you are not alone. How can you ensure that registering your site will effectively provide access to millions of Web users? Perhaps there are methods that can be utilized to improve retrieval.
Within their home pages, each search tool service (Lycos, HotBot, etc.) provides an explanation of their indexing method. The authors wanted to explore selected search tool indexing schemes, as well as the ability of these engines to harvest indexing terms from various sources within the HTML document. The authors also looked at the time frame within which the indexing occurs in order to better understand and provide access to a typical World Wide Web site. Our efforts concentrated on exploring the retrieval results of several search tools in retrieving documents with HTML title tags, textual character strings, and META descriptors planted within several layers of a sample Web site.
The Project Whistlestop (http://www. whistlestop.org)Web site was used for the experiment. Project Whistlestop, at the University of Missouri-Columbia, was designed to provide digital access to primary source materials from the Harry S. Truman Presidential Library and Museum. Using pages in this Web site, the authors planted contrived indexing terms at particular intervals within the text, and within several "layers" of pages in order to test retrieval within certain Web search tools.
The literature on Web search tool adequacy has an abundance of reviews of multiple Web search tools. These reviews covered the usability of search tools and changes that occur in search tool operation. Data were compiled from these articles, as well as the online help provided by each search tool producer, to be compared with the retrieval results for each engine. Several studies explored the success of certain search queries in several engines (Kimmel 1996; Courtois 1996; Randall 1995). This research, though helpful, does not attempt to judge the success of registration and the performance of robots. Scott Nicholson (1997) examines the indexing and abstracting methods of several search tools, and a new text by Maze, Moxley, and Smith (1997) titled Authoritative Guide to Web Search Tools was helpful in providing information concerning several of the search tools in this study.
Web Site Indexing
The prevalent indexing technique used by search tools utilizes so-called robots (also called spiders.) A robot is a program that automatically references World Wide Web documents and follows the hyperlinks within those documents. Robots are often used in the process of creating an index for Web search tools. Depending on the search tool, the robot may begin by visiting pages that have been registered by individual Web site creators, sites that have numerous hyperlinks, or sites that have been identified as "most popular" (Koster 1998).
After identifying new sites, the robot records the Uniform Resource Locator (URL) in order to revisit and update the site. The new sites are then indexed. It is at this point in the process that questions arise concerning the method of information extraction used by the robot during the indexing process. Relevant information may be extracted from the title, META tags, or text of the new sites. …