Progress Report: The British Library and Microsoft Digitization Partnership

Article excerpt

Microsoft made it clear that it wasn't going to let Google tackle mass book digitization exclusively when it announced a partnership with The British Library (BL) in November 2005.

The BL/Microsoft project is designed to digitize 25 million pages of 100,000 out-of-copyright titles from the BL collection related to 19th-century literature. Access will be provided via Microsoft's Live Search Books site (http://books .live.com) and the BL's Web site (www .bl.uk). Live Search Books now includes many partners: The University of California Libraries, Cornell University Library, the University of Toronto Library, The New York Public Library, and the American Museum of Veterinary Medicine have all joined, as well as more than 50 publishers.

[ILLUSTRATION OMITTED]

Wider Access to Lesser-Known Authors

Kristian Jensen, head of British Early Printed Collections, reviewed the selection process. Unlike previous BL digitization projects where material had been selected on an item-by-item basis, the sheer size of this project made such selectivity impossible. Instead, the focus is on English-language material, collected by the BL during the 19th century. Jensen compared the process to mass microfilming. "Nonselectivity widens access," he said.

Being less selective creates certain advantages, however. First, it lessens the domination of the well-known author, or the high profile enjoyed by the "already famous." The works of virtually unknown writers will be brought to the attention of scholars as easily as material by Charles Dickens. The collection is being processed by the same classifications used at the time of original acquisition. So unusual classes today, such as "19th-century female poets," become accessible as research areas. The benefits of looking at the literature as it would have been available at the time will be welcomed by educationalists, and delicate literary items will also benefit from not being overhandled in the future.

Another benefit of the selection process is that entire shelf runs can be taken for scanning at one time. After trying a couple of pilot runs to assess quality standards, Microsoft and the BL chose CCS (Content Conversion Specialists) as the scanning contractors. Richard Helle, CCS managing director, provided a tour of the digitization studio at the BL's press event in September.

The target is to scan 50,000 pages per day with a 2-year timetable for completion. However, none of the valuable material is subjected to any risk with such fast output. Helle emphasized that these "treasures" were being scanned nondestructively, and all staff involved had received careful training. Book movement pilots were run in advance to determine the amount of staff time required during the full process, from selection and retrieval to delivery, scanning, and reshelving.

Scanning and OCR Conversion

Limits have been established for maximum and minimum book size in terms of what can be scanned now, which prevents digitization of about 20 percent of the relevant collection. Everything is tracked with bar codes, and the condition of each book is checked to ensure that it can stand up to the scanning process. Four Kirtas Technologies BookScan machines are now being used.

These provide semiautomated scanning with an operator in place to ensure that all pages are turned accurately, to preview the quality of the images, and to adjust color settings that can vary with temperature. A separate scanner is used to handle books with fold-out pages.

Scanning produces high-resolution images (300 dpi) that are then transferred to a suite of 12 computers for OCR (optical character recognition) conversion. The scanners, which run 24/7, are specially tuned to deal with the spelling variations and old-fashioned typefaces used in the 1800s. The process creates multiple versions including PDFs and OCR text for display in the online services, as well as an open XML file for long-term storage and potential conversion to any new formats that may become future standards. …