Emerging products: Data classification

Many organizations plan to implement data leakage protection (DLP), but many believe that DLP will solve all of their data exfiltration problems. They're wrong.

While some data leakage problems can be solved with DLP alone - it's pretty easy to spot Social Security numbers or credit card numbers -some of the most onerous require a lot more definition than DLP alone can give. It's a bit like expecting an IDS to identify all threats without defining what that threat consists of. Something needs to tell the DLP system what is acceptable data exfiltration and what is unacceptable leakage. That's where data classification comes in.

Data classification, done manually, is one horrendous slog. Manually working through tens or hundreds of thousands of documents and email to classify them would not be even remotely acceptable to just about any organization. So most never do it. But that's only one issue.

Another important piece of the data classification puzzle is determining who owns the data. In most organizations, it is hard to make that determination for certain types of asset. If the organization has an ERP system, for example, who owns the backend data that makes it work? More important, which backend data? A big ERP system can have financial, inventory, HR and other types of data sets.

Most enterprises today use discretionary access control and therein lies the answer. If we authorize access to data based on authorization from the owner, why not classify data the same way? That may seem to take us back around to 'nobody wants to take the responsibility for owning the data,' but if we get a bit more granular we can say that anyone who creates a discrete piece of data owns it and owns the right - responsibility, really - to classify it. If we create classification guidelines, we can make that process quite easy for the users.

Now we get back to the problem of classifying legacy data. That is, as we said, a huge challenge. However, we can create policies that define touch points in a document or email that help determine its classification. If we keep the number of classifications simple - public, internal use and confidential, for example - we can teach our data classification tool to recognize and classify documents correctly.

For this month's emerging products, we looked at some of these tools and were pleased with what we saw. We consider the whole notion of automated data classification to be an emerging area of cybersecurity because it is not in widespread use - even though it should be. There are a couple of tools that have been around for a while and we see these improving every year, but we also see new players on the scene and they are doing some rather interesting things.

Functionalities we look for in these products are an ability to perform bulk data classification - i.e., classifying the mass of legacy documents and emails sure to exist at the time you deploy the data classification tool - ease of use for the end-user classifying his or her data, and the ability to carry the classification with the document no matter where the document ends up.

We also like to see the effect of mixed classifications. For example, if we take a public email and attach a confidential document, what happens? Does the system refuse to send the email? Does it upgrade the email to confidential? Or does it raise an alert and then give the user the option? It might do a combination of these as well.

Finally, we look at how well the tool interacts with other offerings, such as DLP systems. The ability to work together is pretty important when we are considering the close relationship between data exfiltration and data classification. Some nice-to-have features include ad hoc classification. In other words, if we create a document on a different computer and bring it to work on a thumb drive will the system try to classify it as soon as it enters the enterprise?

In general, then, we look for the more advanced functionality when we are looking at emerging products and this batch doesn't disappoint.