Academic journal article Information Technology and Libraries

Digitization of Text Documents Using PDF/A

Academic journal article Information Technology and Libraries

Digitization of Text Documents Using PDF/A

Article excerpt


The purpose of this article is to demonstrate a practical use case of PDF/A for digitization of text documents following FADGI's recommendation of using PDF/A as a preferred digitization file format. The authors demonstrate how to convert and combine TIFFs with associated metadata into a single PDF/A-2b file for a document. Using real-life examples and open source software, the authors show readers how to convert TIFF images, extract associated metadata and International Color Consortium (ICC) profiles, and validate against the newly released PDF/A validator. The generated PDF/A file is a self-contained and self described container that accommodates all the data from digitization of textual materials, including page-level metadata and ICC profiles. Providing theoretical analysis and empirical examples, the authors show that PDF/A has many advantages over the traditionally preferred file format, TIFF/JPEG2000, for digitization of text documents.


PDF has been primarily used as a file delivery format across many platforms in almost every device since its initial release in 1993. PDF/A was designed to address concerns about long-term preservation of PDF files, but there has been little research and few implementations of this file format. Since the first standard (ISO 19005 PDF/A-1), published in 2005, some articles discuss the PDF/A family of standards, relevant information, and how to implement PDF/A for born-digital documents. (1)

There is growing interest in the PDF and PDF/A standards after both the US Library of Congress and the National Archives and Records Administration (NARA) joined the PDF Association in 2017. NARA joined the PDF Association because PDF files are used as electronic documents in every government and business agency. As explained in a blog post, the Library of Congress joined the PDF Association because of the benefits to libraries, including participating in developing PDF standards, promoting best-practice use of PDF, and access to the global expertise in PDF technology. (2)

Few articles, if any, have been published about using this file format for preservation of digitized content. Yan Han published a related article in 2015 about theoretical research on using PDF/A for text documents. (3) In this article, Han discussed the shortcomings of the widely used TIFF and JPEG2000 as master preservation file formats and proposed using the then-emerging PDF/A as the preferred file format for digitization of text documents. Han further analyzed the requirements of digitization of text documents and discussed the advantages of PDF/A over TIFF and JPEG2000. These benefits include platform independence, smaller file size, better compression algorithms, and metadata encoding. In addition, the file format reduces workload and simplifies postdigitization processing such as quality control, adding and updating missing pages, and creating new metadata and OCR data for discovery and digital preservation. As a result, PDF/A can be used in every phase of a digital object in an Open Archival Information System (OAIS)--for example, a Submission Information Package (SIP), Archive Information Package (AIP), and Dissemination Information Package (DIP). In summary, a PDF/A file can be a structured, self-contained, and selfdescribed container allowing a simpler one-to-one relationship between an original physical document and its digital surrogate.

In September 2016, the Federal Agencies Digital Guidelines Initiative (FADGI) released its latest guidelines for digitization related to raster images: Technical Guidelines for Digitizing Heritage Materials. (4) The de-facto best practices for digitization, these guidelines provide federal agencies guidance and have been used in many cultural heritage institutions. Both the PDF Association and the authors welcomed the recognition of PDF/A as the preferred master file format for digitization of text documents such as unbound documents, bound volumes, and newspapers. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed


An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.