Kim Crawley, researcher, InfoSec Institute
Kim Crawley, researcher, InfoSec Institute

Big Data is a big buzzword these days. It's completely understandable why so many people in tech talk about it, even though few people completely understand it.

So, what is Big Data?

With the massive growth of data centers worldwide in the past thirty years or so, we're creating, transmitting, and storing more data than ever before.

We're well beyond terabytes, petabytes and now even exabytes. We're quickly zooming into zettabytes in global capacity and transmission. A petabyte is about a thousand terabytes, an exabyte is about a thousand petabytes, and a zettabyte is about a thousand exabytes. It's absolutely mindblowing.

If you're a visual thinker, here are some handy graphs, courtesy of the American Association for the Advancement of Science.

The global capacity to store, communicate, and compute data:

        Global installed capacity


        Global effective capacity to communicate data


        Global capacity to broadcast data in compressed megabytes

As you can see, Big Data and its immensity include data from all kinds of sources. But when we talk about Big Data, we usually mean the data in the millions of servers around the world, both on and off of the internet. In computer science, we refer to how it's a quantity of data, both in storage and in transit, that's very difficult for our commercial and institutional computer systems to manage, curate, and analyze in an expedient fashion.

A lot of that data that we can't effectively act upon is malicious. That includes malware, and information security attacks in networks and in distributed computing clusters.

Herein lies the big security problem.|

Big Data security vulnerabilities

A lot of the most popular software to manage Big Data initially wasn't designed with security in mind.

Hadoop launched in 2005. Even now in 2014, Hadoop has so many vulnerabilities that it scares me a little.

Hadoop isn't developed with much encryption implementation, nor compliance with common information security policy standards. Hadoop still has no encryption on nodes, nor the data transmitted between them. The project was originally developed just to handle publicly available data, such as the web. It does use Kerberos for authentication, but most network adminstrators and security professionals know how difficult it is to implement.

Even popular Hadoop tools such as HBase, Pig, and Hive lack security measures in their implementation.

Consider the massive number of insecure nodes in any and all Hadoop systems. Then multiply that by how widespread Hadoop is. A lot of Microsoft Azure systems use Hadoop, as do Amazon EC2/S3 services, Yahoo!, and most of the world's most popular websites, e-commerce sites, and cloud services. On November 8th 2012, Facebook announced that they have over 100 petabytes in their Hadoop system, and it grows by about half a petabyte per day. In 2013, most than half of all Fortune 50 companies say they use Hadoop.