Peter Jacso is associate professor of library and information science at the department of information and computer sciences at the University of Hawaii. He won the 1998 Louis Shores/Oryx Press Award from ALA's Reference and User Services Association for his discerning database reviews. His e-mail address is

How well do Google's indexing and PageRank features measure up?

Don't look for the word "Google" in your dictionary. It will not be there. Neither is it one of the words that I make up to the dismay of my editors (and my students). It is a derivative of "googol." And what is googol? It is 10 i.e., I with a hundred zeros after it. It was coined- according to the Merriam-Webster Dictionary-by Milton Sirotta nearly 50 years ago. So what is Google? It is a new search engine ( developed by talented Stanford University graduates. And anything that comes out of Stanford and has to do with the Internet will make investors and venture capitalists see a googol of money. But can you take it to the bank? I will tell you in this sequel to my previous column about another much-hyped Web search engine, Direct Hit.

What Is Google?

Google is a search engine with a twist. It crawls Web pages, as other search engines do, and indexes them. At this stage it has a relatively small collection of about 30 million Web pages compared to the biggest collections-Northern Light, AltaVista, and HotBot each have at least three times as many pages.

It is the indexing that is different and unique-for those who have never heard of citation indexing, that is. Each page is assigned a PageRank that is calculated by a) how many other pages refer, or link, to the page, and b) how important the linking pages are. If this idea sounds familiar, that's because it is. The concept is analogous to the one Eugene Garfield developed 30 years ago, and what the Institute for Scientific Information implemented in its citation database and in the Journal Citation Reports (JCR). PageRank is somewhat similar to (although far less sophisticated or scholarly than) the Impact Factor of journals. Google automatically assigns a PageRank to Web pages based on a) the number of other pages that cite it, and b) the PageRank of those pages. It-sounds good but it has some flaws.

Searching Google

There is a simple input cell where you enter your search request. It may be a single word (publishing), a combination of words with an implied "and" relationship (Web database publishing), an exact phrase ("Web database publishing"), or two or more words but not other words specified (Web database publishing-Java). Google does not allow an "or" operator, nor does it allow truncation symbols. This means that one needs to formulate two or more queries to retrieve pages that include both the singular and the plural form (database or databases), and the different spellings of both (data base, data bases). This is a surprising and inconvenient limitation in 1999. A simple checkbox next to the query cell could be used to allow the users to enable/disable stemming.

Entering "Information Today" yields 6,078 hits (Figure 1). It sounds flattering, but the overwhelming majority of hits were pages that included the query term as "News Information; Today's Weather," or "Call for information today" (Figure 2). The entries in the result list show the PageRank of the cited page, its URL, an excerpt from the page with the matching term, a hotlink to the page when it was cached (for indexing), the size of the page, and the number of times the query term occurs on the page.

Clicking on the URL following the PageRank score takes you to the current page, and clicking on the cached hotlink takes you to the page as it was when it was indexed. Clicking on the relevance bar will list the citing pages (Figure 3). I tried the first five but none of them seemed relevant (a science summer camp, a poem by Heinrich Heine in German, a wine gourmand page, etc. …

