Search Engines for Library Web Sites
Beiser, Karl, Library Technology Reports
A search engine enables a user of electronic data resources to quickly locate the specific information desired from within a large volume of mostly unrelated or extraneous information. A variety of software packages have been developed to enable the administrator of a search site to gather, index, and provide a user search interface for retrieval of references to online resources. Some products are aimed at meeting the prodigious and particular needs of Internet portal sites such as Excite and Yahoo that attempt to index a significant percentage of the Internet. Still others are engineered and priced to support more modest objectives, such as the indexing of pages within a single Web site, for searching by visitors to that Web site.
Typical search engine needs for a library Internet site fall between these extremes. Usually, the indexing of a library's own site is a given. Often, inclusion of the site of the library's institutional parent organization is also a requirement. But as they develop larger and more ambitious Internet sites, librarians find themselves drawn to creation of specialized search capabilities that immediately reach beyond purely local content. Some libraries may wish to provide comprehensive search access to Internet sites pertaining to a given geographical area or governmental jurisdiction. Others may wish to focus on a particular set of subject areas that parallel the library collection. In some environments, thorough coverage of a collection of specific corporate Web sites or support for other approaches to achieving a level of competitive intelligence may hold special value.
Along the way, it may be necessary to index a wide variety of document formats, beyond plain-vanilla HTML (Hypertext Mark Up Language) files, both local to the library site and accessible from a remote server.
This report looks at search engine software of use in this middle ground. Included are products appropriate to creating a searchable index of from several hundred to several hundred thousand Web pages. Products that cost more than $10,000, in a basic configuration, have been excluded as likely to exceed the cost limitations of most libraries. Given the cost considerations that typically attend certain high-end operating systems, only those products that runs on Windows NT/98/2000, Linux, or the Macintosh have been covered in depth. A number of major products that fall outside these guidelines are listed in the Products Briefly Noted section.
Technology and Product Overview
What Is a Search Engine?
The term "search engine" carries a variety of overlapping and even contradictory meanings for Internet users, depending on their experience and perspective. Some apply it to any Internet finding aid whatsoever. Virtually any large list or other tool for identifying an Internet resource by topic, concept, or keyword qualifies. This usage is too broad to serve any practical purpose here. A topically organized system for classifying links to Internet resources may be a valuable finding aid. It may be a valuable search guide or search aid, but it is not a search engine.
Search engines earn their name by offering keyword searching capability, based on the indexing of text contained within a document, and by delivering a list of references with Web links that match the search words entered by the user. Publishers of Internet directory services may coincidentally also offer search engine access to the Internet at large. They may even offer a search engine that can be constrained to search only the contents of the site's own directory pages. This useful recursion notwithstanding, a search engine is a search engine and a directory is a directory, and the maintenance of a clear view of the distinctions between them will aid clear discussion of the products treated in this report.
A more subtle distinction is called for between "search engine" as the name of a software genre, and "search engine" as a particular retrieval service and the indexed links it contains. …