Academic journal article Journal of National Security Law & Policy

The Democratization of Big Data

Academic journal article Journal of National Security Law & Policy

The Democratization of Big Data

Article excerpt

In recent years, it has become common for discussions about managing and analyzing information to reference "data scientists" using "the cloud" to analyze "big data." Indeed these terms have become so ubiquitous in discussions of data processing that they are covered in popular comic strips like Dilbert and the terms are tracked on Gartner's Hype cycle.1 The Harvard Business Review even labeled data scientist as "the sexiest job of the 21st century."2 The goal of this paper is to demystify these terms and, in doing so, provide a sound technical basis for exploring the policy challenges of analyzing large stores of informa- tion for national security purposes.

It is worth beginning by proposing a working definition for these terms before exploring them in more detail. One can spend much time and effort developing firm definitions for these terms - it took the National Institutes of Science and technology several years and sixteen versions to build consensus around the definition of cloud computing in NIST Special Publication 800-1453 - the purpose here is to provide definitions that will be useful in furthering discussions of policy implications.

Rather than defining big data in absolute terms (a task made nearly impos- sible by the rapid pace of advancements in computing technologies) one can define big data as a collection of data that is so large that it exceeds one's capacity to process it in an acceptable amount of time with available tools. This difficulty in processing can be a result of the data's volume (e.g., its size as measured in petabytes4), its velocity (e.g., the number of new data elements added each second), or its variety (e.g., the mix of different types of data including structured and unstructured text, images, videos, etc...).5

Examples abound in the commercial and scientific arenas of systems manag- ing massive quantities of data. YouTube users upload over one hundred hours of video every minute,6 Wal-Mart processes more than one million transactions each hour, and Facebook stores, accesses and analyzes more than thirty petabytes of user-generated data.7 In scientific applications, the Large Hadron Collider generates more than fifteen petabytes of data annually which are analyzed in the search for new subatomic particles.8 Looking out into space rather than inward into particles, the Sloan Digital Sky Survey mapped more than a quarter of the sky gathering measurements for more than 500 million stars and galaxies.9

In the national security environment, increasingly high quality video and photo sensors on unmanned aerial vehicles (UAVs) are generating massive quantities of imagery for analysts to sift through to find and analyze targets. For homeland security, the recent Boston marathon bombing investigation proved both the challenge and potential utility of being able to quickly sift through large volumes of video data to find a suspect of interest.

While the scale of the data being collected and analyzed might be new, the challenge of finding ways to analyze large datasets is a problem that has been around for at least a century. The modern era of data processing could be considered to start with the 1890 census where the advent of punch card technology allowed the decennial census to be completed in one rather than eight years.10 World War II spurred the development of code breaking and target tracking computers, which further advanced the state of practice in rapidly analyzing and acting upon large volumes of data.11 The Cold War along with commercial interests, further fueled demand for increasingly high performance computers that could solve problems ranging from fluid dynamics and weather to space science, stock trading and cryptography.

For decades the United States government has funded research to accel- erate the development of high performance computing systems that could address these challenges. During the 1970s and 1980s this investment yielded the development and maturation of supercomputers built around specialized hardware and software (e. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.