Magazine article Information Today

Tools for Unearthing PDF Files

Magazine article Information Today

Tools for Unearthing PDF Files

Article excerpt

Google provides a superior Adobe PDF searching experience

There has been much talk about the Invisible Web in the past 12 months. Beyond the databases--even the freely accessible ones--that generate a temporary HTML output in response to a query of a collection of records stored in a proprietary format, Adobe PDF files represent a large domain of the Web that is invisible to traditional search engines, and thus, to users.

The latest innovation by Google will significantly enhance access to worthy sources of information by handling the PDF files that are available to the public. This is an important development as an increasing number of sites include PDF documents, and often without any accompanying HTML file that would at least alert users about the availability of PDF files that are relevant to their research.

But haven't there been search engines that could extract keywords from PDF files? Yes, there have been, but there is a big difference. Those are engines that can be used for your own Web site or intranet, but they are not Web-wide crawlers that are sent out to the Web to collect information about PDF documents. Seemingly, Adobe's Search PDF site came the closest to the service that Google has recently developed, but a closer look at both reveals Google's superiority.

The Adobe Search PDF Site

Adobe teamed up with Alta Vista last year to create a special site (http://search pdf.adobe.com) that is believed to include information about more than 1 million PDF files. However, this number should be taken with a grain of salt, as all of my test searches brought up many duplicate items. You're not searching the entire text of the PDF files, but only the text of the title, abstract, and keywords as extracted by Adobe from the PDF files.

The extraction is not always successful even from text files, let alone from PowerPoint PPT files or spreadsheet files that were converted to PDF. There are often serious problems in every aspect of the extraction process. The summaries can be very cryptic if you don't know the document (see Figure 1).

The data for the number of pages and the size of the files are often mixed up. For example, Charles Bailey's classic bibliography keeps growing, to our pleasure, but not to the tune of 258,206 pages. And, of course, its size can't be 88 bytes, and it can't be downloaded in 0 seconds. The data are true in reverse: 88 pages and 258,206 bytes. On a 56 Kbps modem it would take some time to download. This is not an exceptional mistake but a very common one at this site.

Also, the file dates are often wrong, and off by about three decades. You'll see a very large number of files purportedly created on the last day of 1969, when there was no such thing as PDF. This suggests that the dates are taken from the file description, not the PDF file itself. I recall that on some older computers, the file-creation date automatically reverted to 1969-12-31 if you did something that Microsoft didn't exactly encourage you to do.

Often, it isn't the title of the PDF document (such as Electronic Dissemination of Scholarly Work) that is extracted and displayed as title information, but the title of the journal (see Figure 2) or some other information extracted from the PDF page.

The site's search options are simple but sufficient; however, being taken to the wrong help file isn't exactly professional or user-friendly. The partner for this project was Alta Vista, which has Basic (or Main) Search, Advanced Search, Power Search, and Raging Search options. If you take the advice of the help file, you'll be surprised.

While the help file tells you that the query mona lisa yields the same result as +mona +lisa (see Figure 3), this isn't true (except for the Advanced Search mode, which is not available for Search PDF). The query scholarly elect ronic publishing finds 15,643 PDF documents. If this seems too good to be true, it is. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.