The problem with Big Data

Big Data is a big buzzword these days. It's completely understandable why so many people in tech talk about it, even though few people completely understand it.

So, what is Big Data?

With the massive growth of data centers worldwide in the past thirty years or so, we're creating, transmitting, and storing more data than ever before.

We're well beyond terabytes, petabytes and now even exabytes. We're quickly zooming into zettabytes in global capacity and transmission. A petabyte is about a thousand terabytes, an exabyte is about a thousand petabytes, and a zettabyte is about a thousand exabytes. It's absolutely mindblowing.

If you're a visual thinker, here are some handy graphs, courtesy of the American Association for the Advancement of Science.

The global capacity to store, communicate, and compute data:

Global installed capacity

Global effective capacity to communicate data

Global capacity to broadcast data in compressed megabytes

As you can see, Big Data and its immensity include data from all kinds of sources. But when we talk about Big Data, we usually mean the data in the millions of servers around the world, both on and off of the internet. In computer science, we refer to how it's a quantity of data, both in storage and in transit, that's very difficult for our commercial and institutional computer systems to manage, curate, and analyze in an expedient fashion.

A lot of that data that we can't effectively act upon is malicious. That includes malware, and information security attacks in networks and in distributed computing clusters.

Herein lies the big security problem.|

Big Data security vulnerabilities

A lot of the most popular software to manage Big Data initially wasn't designed with security in mind.

Hadoop launched in 2005. Even now in 2014, Hadoop has so many vulnerabilities that it scares me a little.

Hadoop isn't developed with much encryption implementation, nor compliance with common information security policy standards. Hadoop still has no encryption on nodes, nor the data transmitted between them. The project was originally developed just to handle publicly available data, such as the web. It does use Kerberos for authentication, but most network adminstrators and security professionals know how difficult it is to implement.

Even popular Hadoop tools such as HBase, Pig, and Hive lack security measures in their implementation.

Consider the massive number of insecure nodes in any and all Hadoop systems. Then multiply that by how widespread Hadoop is. A lot of Microsoft Azure systems use Hadoop, as do Amazon EC2/S3 services, Yahoo!, and most of the world's most popular websites, e-commerce sites, and cloud services. On November 8^th 2012, Facebook announced that they have over 100 petabytes in their Hadoop system, and it grows by about half a petabyte per day. In 2013, most than half of all Fortune 50 companies say they use Hadoop.

Chances are, you interacted with a Hadoop system today, and you didn't even know it.

One alternative to Hadoop is the open source Sector/Sphere project. Sector is a file server system, and Sphere is its processing engine.

Sector supports file transfer encryption, unlike Hadoop. It's also better designed to be integrated into secure protocols, such as Lightweight Directory Access Protocol (LDAP).

Unfortunately, its use is nowhere near as widespread as Hadoop. Hopefully, that'll change.

An even bigger issue is the very nature of Big Data. It consists of millions of servers, in countless data centers worldwide. Each is a building with all of the associated physical security vulnerabilities.

The way it's all linked together, and the speed in which it operates, results in so many software vulnerabilities and other bugs that we can barely make a dent in reporting and hardening all of them.

So, what can we do about it?

As I mentioned, when building a Big Data system, consider software that implements as many encryption and authentication measures as possible. All of the principles that pertain to hardening data centers can be extrapolated to hardening enoromous data centers, which is the easiest manifestation of Big Data to conceptualize.

Physical security is often overlooked and it mustn't be. Is there any way for unauthorized individuals to enter your server rooms?

Log everything, as much as technologically possible. Don't build any data systems until you have the hardware, software, and network infrastructure necessary to log a lot more data and events than you think you'll ever have. Use the most appropriate log analysis software for each and every system and function. But also make sure that well trained human beings check those logs on a frequent schedule – don't depend on alerts.

Hire third-party penetration testers at least once a year. Make sure they attack everything, including your people. They're the means of social engineering vulnerabilties.

When the penetration testers have thorough reports, don't overlook a single detail. Compare their findings to your network configuration and IT security policy, and make adjustments accordingly. Quite often, security hardening involves spending money on hiring more domestic IT security professionals, in addition to more hardware and networking appliances and equipment. I suspect that corporate executive stinginess to their IT departments is a Big Data vulnerability that's often overlooked. Far too often.

Big Data just keeps on getting bigger and bigger. It's almost like Moore's Law. And...it has a domino effect.

Our Big Data technology wasn't initially developed with security in mind, but we must work hard to correct that. It requires fixing what we have now, and constantly monitoring and fixing systems as they grow. Vigilance is a must. However overwhelming the international effort may be, if we set our minds to it, we can do it!