Magazine article Information Today

The Key to Smart Big Data: Know Thy Technology

Magazine article Information Today

The Key to Smart Big Data: Know Thy Technology

Article excerpt

Insider s Perspective gives guest columnists a chance to write about challenges and solutions in their corner of the information technology industry.

Despite the endless volume of blog posts, articles, and webinars espousing the value of nurturing Big Data, one of the most overlooked and underappreciated topics is the nature of data and why understanding it is paramount to realizing its promised value.

What do I mean by the nature of data? I'm referring to all of the aspects of a particular type of data that qualify it as Big Data:

* How much of it is there?

* Where does it live?

* How quickly is it created?

* How quickly do users need it?

* How do I get insight from it?

Answering these questions for a specific source of data forces you to think differently about how to approach managing it. For many, this means reaching outside the bounds of traditional business intelligence (BI) and warehousing infrastructure for the first time and into the world of Big Data management.

Deciding what technologies to use is a critical step because the technology or technologies must be appropriate for the nature of the data; otherwise, you're putting yourself back in the data warehousing trap of mapping data to the technology as opposed to mapping technology to the data.

Two Types of Unstructured Data

It's common to hear talk of structured and unstructured data, but it's important to separate unstructured data into two categories: unstructured data and unstructured content. The reason for making this distinction is that the approaches to getting insight from each of these sources are very different.

Unstructured data refers to machine-generated data such as system logs, web-click logs, or sensor data. Unstructured data can also refer to XML data. Generally, it refers to data that has structure to it, but it is not stored in a database because it's too big, arrives too quickly, or requires NoSQL approaches to getting insight from the source. This type of data tends to require basic parsing and lends itself to "query-time" analysis such as aggregation or more sophisticated analytics such as click-stream analysis or predictive analytics. So, the ELT (extract, load, transform) nature of Hadoop maps perfectly to this type of data.

On the other hand, unstructured content refers to data sources in which the insight is contained in human-created text. Unstructured content includes social media, email, documents, and even text within "structured data" systems such as CRM systems. This data requires complex extraction and insight derivation (sentiment, key topics, entities, and so forth), and insight from these sources is typically derived on a per-record basis. So, this data tends to lend itself to an ELT approach, as the processing steps (file extraction, language processing, and text analytics) required to gain insight from these sources are too expensive to perform at query time.

There are also other types of data. We could make a case for "unstructured media" being a category, as different techniques such as speech-to text and facial recognition may be required, depending on a given use case. I tend to think of these as a subclass of unstructured content since the output in many cases of the specialized processing is text, but the distinction is very important. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.