Magazine article Information Today

Data Mining in the Humanities and Social Sciences

Magazine article Information Today

Data Mining in the Humanities and Social Sciences

Article excerpt

We are all aware that the past 20 years have seen an explosion in the amount of research material and primary sources found online. However, this revolution in the availability of information naturally presents problems, not only for a range of researchers, but also for librarians as guardians and promoters of its use. How does one get the most value out of what has been collectively termed Big Data?

Data mining, or text analysis, as it is often referred to, is a concept with growing prominence that is closely associated with Big Data. Most folks in the industry have heard about it, but it is a term that can mean different things to different people.

The core concept is that computer software applies automated analytical techniques to interrogate datasets for patterns, trends, and other useful information. This process typically would be incredibly labor-intensive to complete or difficult to conceive with traditional human research.

One of the key assumptions therein is that the data is available to the software to carry this out. Indeed, the more controversial aspects of data mining revolve around the practicalities and responsibilities related to making information available to the software as well as the software "mining" that data from sources in order to analyze it. However, before addressing this in more detail, let's reflect first on the exciting possibilities that text analysis can bring.

The Benefits

Data mining has the potential to empower a new breed of humanities scholar and facilitate how he or she approaches research. This can range from the relatively complex, such as software that can recognize syntax to analyze literary composition, to the more simple illustration of new pathways and associations.

For example, traditional online searching of a well-cataloged criminal court record archive would return thousands of cases containing a certain crime. Hits are highlighted, but that is a matter primarily of discovery. Where does the undergraduate go from there?

Imagine if text analysis had been harnessed to immediately present all the crimes tried in order of frequency. One could focus on how conviction or sentencing rates changed based on gender, occupation, location, or social class. How did these trends change over time, and do they reflect key social and economic events such as economic depressions or a demobilized army? This type of text analysis is an end in itself, but it also brings the material to life.

The consistency and quality of the data is of massive importance. If you are reliant on electronic excavation of information and trends, you want the basics to be correct in order to avoid erroneous results. Full-text accuracy is critical. Tagging text and data with useful identifiers can also become invaluable when applied at scale.

Collaboration of skills in the sector is important to get the most value out of the data. There are many who are already heavily involved in the digital humanities, but software development and manipulation may not be the natural home of the arts scholar. However, this should in no way deter using text analysis. The crossover already exists in many areas, not least in the creation of online collections that have already been produced, and this relationship should be expanded.

Copyright Legislation

An Association of Research Libraries (ARL) issue brief (bit. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed


An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.