A New Era of Search Engines: Not Just Web Pages Anymore

Article excerpt

Web search engines are entering a new era.

With traditional abstracting and indexing services, a rigorous search requires expertise in editorial policies: What kinds of documents, time frames, degree of selectivity, and exceptions should be expected? Is it journal articles, technical reports, or newspapers? Does it go back to 1987 or 1995? Does it have cover-- to-cover indexing, full text, and/or abstracts? Are letters to the editor included or not?

Searchers tend not to apply the same rigor to Web search engines, knowing that searches retrieve only Web sites and Web pages. That is changing. Web search engines no long search just plain old ordinary Web sites. Information professionals should apply the same evaluation techniques to a Web search as they have to traditional online databases. You need to think in terms of what kinds of "documents" you should expect to get. Although Web pages still predominate, consider the varying types of content now searchable on the Web.

The serious searcher should indulge in some traditional thinking about what is retrievable and expand the list of categories included as retrievable "document types." In addition to Web pages, searches can be done for news articles, PDF (and other) file types, images, audio files, video files, and some other odds and ends. With a heightened awareness along these lines, you will have a better idea of what you can get through a particular tool (engine), which engines) to use for a particular question, and overall, a much better idea of what is possible.

To understand content coverage, keep two things in mind:

(1) In some cases various kinds of documents will be retrieved automatically in a regular search using the search engine's main query box(es). In other cases, retrieving the various documents will require that you specify a separate database.

(2) Not only is the variety of content important, but also the searchability of that content. Can you specify that you just want, for example, PDF files, and can you specify particular characteristics of these various document types? With images, for example, can you specify file format, colorations, and file sizes?

PLAIN OLD WEB PAGES

Think of the "Web pages" category of retrievable documents as those documents written in HTML, a distinct document type from the searchability perspective. When you enter a term in the main search box of a search engine, you will retrieve pages that have your search term as text somewhere on the page-- or did, that is, when the engine crawled the page.

Rather obviously, major Web search engines, which I define as AllTheWeb, AltaVista, Google, HotBot, Lycos, Teoma, and WiseNut, primarily search and retrieve Web pages. (The engines listed are a bit arbitrary but reflect a combination of size and popularity within the professional searcher community.) The next obvious question is "how many" Web pages? Roughly, the numbers are as follows, in size order: IMAGE FORMULA13

On the "very good news" front, it is a relief to now be able to use billions rather than millions as the easiest measure with which to work when talking about Web page content. For a more precise analysis of size, be sure to take a look at Greg Notess' evaluation of search engine database sizes [searchengineshowdown.com]. Greg does an excellent job of analyzing size, taking into account factors such as duplication and dead links.

TIME FRAMES

Information professionals look at both ends of a time frame when evaluating sources: how far back in time a tool goes and how current the content is. For Web pages, the historical part is easy to answer. Web pages in the search engines can go back as far as Web pages go back, generally to somewhere in the early to mid-1990s. On the other hand, it can just as validly be said that search engines attempt to cover only "current" pages, since very little archiving is done. …