Downloading Data from the Web
Curle, David, Online
Just when you thought you had mastered the art of downloading information from online databases and using a word processor to dress it up into a nicely-formatted, tidy report, along comes the World Wide Web and maKes a mess or everything again.
Regardless of its great value as a source for data of all kinds, there are a number of differences that make downloading and editing output from the Web more difficult than traditional online databases. The main differences are discussed shortly. A step-by-step review of several methods for downloading Web content will follow.
But first, a reminder: Data on the Web may be "free" in the sense that it costs nothing to access and use on most Web sites, but you are not entirely free to reproduce and re-use that data any way you please. Any such re-use must be consistent with applicable copyright law. (See articles by Stephanie C. Ardito and Robert Weiner in recent issues of ONLINE [1,2].)
THE WEB VERSUS TRADITIONAL ONLINE DATABASES
What are the problems that the Web presents that traditional online databases do not-or that we have learned to master?
Organizationally Inconsistent Collections of Documents An online database consists of a number of records, each of which has a certain consistent structure. Even across various database producers and hosts, there are standard document types: bibliographic records, abstracts, full-text articles, company directory records, etc. That consistency makes downloading and manipulating search results predictable and relatively simple.
By contrast, on the Web you find a wild variety of document types and organizational models. A Web site can be a hierarchically organized group of documents linked from a main menu, or it can be a loosely connected collection without any discernible organizational principles. It can include its own search engine that will return documents from a large collection. A single page of HTML (HyperText Markup Language) can consist of a few lines of text or several megabytes of information. It might link to other documents located within the same Web site or on another Web site hosted on the other side of the world. No two Web sites are alike, which makes extracting information from them something of an adventure.
Multiplicity of Document Structures
The typical online database contains documents with strictly defined data elements such as a title, bibliographic elements, indexing terms, abstract, and text. Text is normally displayed flush left in a single, 80-character wide stream of text. A downloaded file of such documents provides predictable raw material for further processing, editing, and printing.
Each separate document in a Web site, on the other hand, consists of a specific file in HTML format. The HTML standard allows the author of each Web page to decide the size of each document, the graphical layout of the document, and organizational elements such as columns, tables, and headings. In short, if everyone is a publisher on the Web, so too is everyone an editor, layout coordinator, and art director-for better or for worse.
In a traditional online search session, the user constructs a search that results in retrieval of one or more records from a finite database. Depending on the host system, the searcher usually has a command available that will display all of those records in a single stream of text data that can be captured and sent to a single capture file.
On the Web, however, you need to "visit" each document separately before downloading them to disk one at a time; you can't simply enter a command to download an entire Web site or all of the documents that your Web search engine has retrieved. "Searching the Web" is really a misnomer; "Searching and Browsing the Web" is a more accurate description of the process. A large percentage of the data you encounter in a Web "search" might be irrelevant. …