Text reveals better than images the spectrums of cost, complexity, quality, and effort involved in digitization. The number of input and output formats to manage in this category is large; the number and variety of source items deserving of digitization is staggering.
Used here, text refers to any source item comprised of written pages. The support media for each page may be paper, parchment, vellum, photostat, as well as any photographic format. All types of writing or printing are included in this category, including handwriting. Text sources may be in any language; some may be printed in multiple languages.
Books and other text sources may be bound or unbounc. Finally, they also may contain images and other nontext components (such as covers, endpapers). These multimedia sources also fall into the text digitization category.
Digitized text refers to three major genres of machine-readable data, each with spectrums of quality to achieve in production:
* Page images--digital images of each page (not searchable)
* Text or hidden text--plain text (ASCII), either keyed (transcribed) from each page, or generated from page images via optical character recognition (OCR) to yield alphanumeric data for indexing and searching, and sometimes for display
* Encoded text--text with descriptive markup (SGML, XML, HTML) to support multiple uses across applications (including navigation among different parts or features of multipage documents); transcribed or generated via OCR, frequently with correction, since encoded text is often displayed
In many library digitization workflows, both page images and either hidden or encoded text are produced as a cost-effective approach to preservation and access. Page images convey original layout and appearance; text facilitates keyword searching.
This chapter describes the program components necessary to produce discoverable, sustainable, and usable collections of digitized text for legacy collections of books, serials, manuscripts, archives, and other multipage works, as well as single-page source material--from small printed ephemera to oversize maps-whose meaningful content is primarily text, or text and line art.
Assuming that a library has already made appropriate program investments in digital library technology (Chapter 1) and digitization program management (Chapter 2), the baseline level of service for text digitization encompasses the staff, systems, and procedures necessary to manage all production tasks--from selection to delivery--for digitization projects.
Baseline text digitization services have the capacity to create page image or text products, with associated descriptive, structural, and administrative metadata that meet the following criteria:
* The digital reproduction is appropriately cataloged and discoverable, and the descriptive metadata are stored in a well-supported system
* The digital reproduction work can be opened and rendered as a properly sequenced, navigable multipage object
* The digital reproduction is appropriately named to be identified by some type of inventory control mechanism (ranging for a printed list to a complex database)
* All files corresponding to each page are appropriately structured, stored, identified, and documented (with administrative metadata) for ongoing management
* The copy can be reliably delivered by the library's (or partner organization's) designated applications for networked access
Fulfillment of these minimum criteria--whether measured against local or community definitions of what is appropriate or good (see, for example, NISO's Framework of Guidance for Building Good Digital Collections)--are presumed to offer the potential for sustainability.
Levels of service above-the-baseline are required for any project that explicitly states requirements for pictorial quality in page images or that mandates production of text of high enough quality to be displayed. …