Academic journal article Information Technology and Libraries

The Efficient Storage of Text Documents in Digital Libraries

Academic journal article Information Technology and Libraries

The Efficient Storage of Text Documents in Digital Libraries

Article excerpt

In this paper we investigate the possibility of improving the efficiency of data compression, and thus reducing storage requirements, for seven widely used text document formats. We propose an open-source text compression software library, featuring an advanced word-substitution scheme with static and semidynamic word dictionaries. The empirical results show an average storage space reduction as high as 78 percent compared to uncompressed documents, and as high as 30 percent compared to documents compressed with the free compression software gzip.


It is hard to expect the continuing rapid growth of global information volume not to affect digital libraries. (1) The growth of stored information volume means growth in storage requirements, which poses a problem in both technological and economic terms. Fortunately, the digital librarys' hunger for resources can be tamed with data compression. (2)

The primary motivation for our research was to limit the data storage requirements of the student thesis electronic archive in the Institute of Information Technology in Management at the University of Szczecin. The current regulations state that every thesis should be submitted in both printed and electronic form. The latter facilitates automated processing of the documents for purposes such as plagiarism detection or statistical language analysis. Considering the introduction of the three-cycle higher education system (bachelor/master/doctorate), there are several hundred theses added to the archive every year.

Although students are asked to submit Microsoft Word-compatible documents such as DOC, DOCX, and RTF, other popular formats such as TeX script (TEX), HTML, PS, and PDF are also accepted, both in the case of the main thesis document, containing the thesis and any appendixes that were included in the printed version, and the additional appendixes, comprising materials that were left out of the printed version (such as detailed data tables, the full source code of programs, program manuals, etc.). Some of the appendixes may be multimedia, in formats such as PNG, JPEG, or MPEG. (3) Notice that this paper deals with text-document compression only. Although the size of individual text documents is often significantly smaller than the size of individual multimedia objects, their collective volume is large enough to make the compression effort worthwhile. The reason for focusing on text-document compression is that most multimedia formats have efficient compression schemes embedded, whereas text document formats usually either are uncompressed or use schemes with efficiency far worse than the current state of the art in text compression.

Although the student thesis electronic archive was our motivation, we propose a solution that can be applied to any digital library containing text documents. As the recent survey by Kahl and Williams revealed, 57.5 percent of the examined 1,117 digital library projects consisted of text content, so there are numerous libraries that could benefit form implementation of the proposed scheme. (4)

In this paper, we describe a state-of-the-art approach to text-document compression and present an open-source software library implementing the scheme that can be freely used in digital library projects.

In the case of text documents, improvement in compression effectiveness may be obtained in two ways: with or without regard to their format. The more nontextual content in a document (e.g., formatting instructions, structure description, or embedded images), the more it requires format-specific processing to improve its compression ratio. This is because most document formats have their own ways of describing their formatting, structure, and nontextual inclusions (plain text files have no inclusions).

For this reason, we have developed a compound scheme that consists of several subschemes that can be turned on and off or run with different parameters. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed


An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.