Academic journal article Information Technology and Libraries

Automatic Extraction of Figures from Scientific Publications in High-Energy Physics

Academic journal article Information Technology and Libraries

Automatic Extraction of Figures from Scientific Publications in High-Energy Physics

Article excerpt


Plots and figures play an important role in the process of understanding a scientific publication, providing overviews of large amounts of data or ideas that are difficult to intuitively present using only the text. State-of-the-art digital libraries, which serve as gateways to knowledge encoded in scholarly writings, do not yet take full advantage of the graphical content of documents. Enabling machines to automatically unlock the meaning of scientific illustrations would allow immense improvements in the way scientists work and the way knowledge is processed. In this paper, we present a novel solution for the initial problem of processing graphical content, obtaining figures from scholarly publications stored in PDF. Our method relies on vector properties of documents and, as such, does not introduce additional errors, unlike methods based on raster image processing. Emphasis has been placed on correctly processing documents in high-energy physics. The described approach distinguishes different classes of objects appearing in PDF documents and uses spatial clustering techniques to group objects into larger logical entities. Many heuristics allow the rejection of incorrect figure candidates and the extraction of different types of metadata.


Notwithstanding the technological advances of large-scale digital libraries and novel technologies to package, store, and exchange scientific information, scientists' communication pattern has changed little in the past few decades, if not the past few centuries. The key information of scientific articles is still packaged in a form of text and, for several scientific disciplines, in a form of figures.

New semantic text-mining technologies are unlocking the information in scientific discourse, and there exist some remarkable examples of attempts to extract figures from scientific publications, (1) but current attempts do not provide a sufficient level of generality to deal with figures from high-energy physics (HEP) and cannot be applied in a digital library like INSPIRE, which is our main point of interest. Scholarly publications in HEP tend to contain highly specific types of figures (as any type of graphical content illustrating the text and referenced from it). In particular, they contain a high volume of plots, which are line-art images illustrating a dependency of a certain quality on a parameter.

The graphical content of scholarly publications allows much more efficient access to the most important results presented in a publication. (2,3) The human brain perceives the graphical content much faster than reading an equivalent block of text. Presenting figures with the publication summary when displaying search results would allow more accurate assessment of the article content and in turn lead to a better use of researchers' time. Enabling users to search for figures describing similar quantities or phenomena could become a very powerful tool for finding publications describing similar results. Combined with additional metadata, it could provide knowledge about evolution of certain measurements or ideas over time.

These and many more applications created an incentive to research possible ways to integrate figures in INSPIRE. INSPIRE is a digital library for HEP, (4) the application field of this work. It provides a large-scale digital library service (1 million records, fifty-thousand users), which is starting to explore new mechanisms of using figures in articles of the field to index, retrieve, and present information. (5,6) As a first step, direct access to graphical content before accessing the text of a publication can be provided. Second, a description of graphics ("blue-band plot," "the yellow shape region") could be used in addition to metadata or full-text queries to retrieve a piece of information. Finally, articles could be aggregated into clusters containing the same or similar plots in a possible alternative automated answer to a standing issue in information management. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed


An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.