Everyone knows the 80-20 principle – except the data classification experts

Vilfredo Pareto, the Italian scientist , would be rolling in his grave if he knew how data is treated in a typical data loss prevention (DLP) solution.

The lion's share of data classification focuses on structured data such as credit card information or social security numbers. While important, this makes up only 20 percent of the total data pool. But unclassified data such as PDFs aren't any less worthy of protection. After all, it was Edward Snowden's NSA's presentations that have really piqued the public interest. Who is looking out for the real bane of data leaks? That hard working individual who takes additional files home but loses them on the return trip? DLP has lots to learn about Pareto's 80/20 principle.

What's with this data classification fetish?

Data classification is crucial to many DLP technologies, I just am not sure why. Classification does not always catch information in optical formats such as JPG and the drivers required for extracting text from various formats are limited. It also needs to be adjusted to the needs of multi-lingual offices and dual character sets. Even worse, there is a high rate of false positives and long time is required for fine-tuning.

It looks like the experts have focused on the wrong target – albeit for some of the right reasons.

There are reasons for protecting structured data: complying with legislated mandates to protect personal identifiable information is a start. But let's be honest. Complex data classification issues are common reasons why DLP projects fail to deliver what the client wants: effective data protection.

Context is just not sexy enough

Contextual analysis looks at where the data is coming from or created. It is enough for a basic level of data security, especially for that 80 percent of data in the unstructured category. It can provide a “light” approach without the false positive issues that have forced businesses to a halt. Contextual analysis often uses a multilayered approach to block data leaks. But while context is capable of identifying an application or a system generating the sensitive data, it often does not work when a new app is used to manipulate the data.

The problem is that context may just not granular enough to satisfy lawyers demanding PII protection compliance and misses that technological “wow” needed to satisfy the consultants and their enterprise customers, currently the financial mainstay of the DLP sector.

Let's talk about the “oops” factor

The industry tends to forget that most data leaks are caused unintentionally, when users aren't really meaning to violate their organization's data handling processes and share confidential R&D plans with the entire internet. Yes, good people do bad oops.

Users can have an active influence on data protection. Some DLP systems now promote this engagement when a rule conflict arises. Users have the choice of stopping or continuing work (after explaining their actions).

Enabling employee interaction is demanding to manage and warning-only systems have their limitations. But on the positive side, merely including the individual user into the DLP process improves the security of data handling processes.

Does it blend?

For DLP to be effective, it's time to rethink the basics.

First of all, data security starts and stops with people. There is a need for a people-centric approach to data security that begins with detecting and preventing unsecure actions, regardless if they are accidental or malicious. Technology wizards need to realize that there are limits to their classification efforts. Context people should go beyond just place to pay more attention to risky types of data.

Yes, there is a trend within the DLP sector for greater user involvement – and I expect this will gain speed over time. This “human factor” approach means taking a close look at what people are doing, sometimes holding end users' hands (educational efforts) and being equipped to slap these hands at other times (enforce IT policies).

The DLP sector has centered too long on enterprise-strength technology. This will have to change if the sector is to grow in the SME segment where ease of implementation and operation are key. In the data context versus classification debate, I believe nobody really wants DLP or any other technology, they simply want their data to be secure and are looking for the most effective way to do it.

So when it comes to protecting your data “peas”, just remember that someone has to pick those pods – and that 20 percent of the pods are holding 80 percent of the peas. Think about your data – all of it – and make Pareto proud.