Data governance: How can you protect what you don’t know exists?

Almost every day there is news of a company's security breach and, increasingly, many of these incidents are originating from an employee or other internal source. It is a costly price to pay, as any company that has gone through the ordeal will attest. Gartner analysts estimate that the cost of sensitive data breaches will increase 20 percent per year through 2009. Security continues to be top of mind, and companies are investing heavily in improved security processes to monitor and manage vulnerabilities, and control access.

Ironically, with all this emphasis on security, Gartner has found little or no correlation between enterprises that spend the most on security and enterprises that are the most secure.

This would seem counter-intuitive, but for experts in the data management field, this finding corresponds with one of the biggest issues companies face as they try to establish data governance – finding the data.

The reality is that data is usually distributed throughout systems, often misinterpreted, and easily stuffed within other bits of information. Changes to business rules and applications cause data to become more obscure and convoluted over time. This makes it extremely difficult to understand where data is located and how it transforms across the enterprise. If companies cannot locate their sensitive data, it cannot be protected no matter how many security applications are deployed.

Why is it so difficult to find the data?

There are many companies such as Vontu, Vericept and Tablus (EMC) that specialize in finding sensitive information in what is called unstructured data, such as text files, excel spreadsheets and email messages. The relatively small amount of unstructured data in a typical laptop usually contains no more than a few dozen pieces of sensitive data. A company's ability to locate this type of data is usually manageable, as long as the right technologies are in place. And many companies have ongoing efforts to analyze and remediate unstructured data.

Unfortunately, most sensitive information is stored in databases and applications – what is called structured data. Corporate databases, for example, contain hundreds or thousands of tables, each with dozens or hundreds of columns and millions of rows. This is where sensitive data protection gets more complex.

In a structured world, sensitive data is elusive

There are two big misconceptions that continue to persist about structured data discovery, to the detriment of enterprise security.

The first misconception is that sensitive information is easier to find in structured environments, because companies mistakenly assume that structured data is broken down into logically named and well documented tables and columns.

The second misconception is that the same technologies and methods used to find sensitive data in unstructured environments can be applied to scan databases in the structured environment.

Unfortunately, there is a set of unique challenges in the structured data world that reveals a different reality. First, there is the problem of detection:

Proximity matters: Unlike files where related data is in close proximity, there are no such guarantees for structured data. Most data is broken into the smallest discreet chunks of information, making it impossible to tell if each discreet value is sensitive. For example, a column containing street numbers would not be distinguishable from any other numeric column, and a customer name that would confirm a customer address may be located in a different table.

Mislabeled or overloaded columns: Column names are often meaningless or misleading, and columns are frequently overloaded with different types of information. For example, the same column that contains purchase order numbers for items bought by companies may also contain credit card numbers for items bought by individuals.

Errors abound: Database columns may contain unexpected data due to human or computer errors. Mistakes can range from accidentally typing a social security number into a name field, to a data load unwittingly loading a customer's social security number as an unprotected Web logon id.

Encoded data: The majority of columns in most large normalized databases contain small enumerated types of encoded data. For example, there may be a “diseases” table that assigns a unique key to each disease. All other tables use this key instead of the actual disease name – a method that makes it easy to update data in a single record and ensure the change is automatically perpetuated throughout. While neither the key nor the disease table is sensitive on its own, when put together and connected to other data, they can yield highly sensitive information.

Data integrity: Theoretically, these key relationships may be specified and enforced by a database to insure data integrity. In practice, however, most production databases turn this capability off for performance reasons - making it difficult to reconstruct the context of key relationships and uncover if different sensitive fields are actually related.

Hidden Sensitive Data: Sometimes sensitive information can be derived from innocuous looking fields. For example, a Web logon may contain the last four digits of a social security number or a demographic code may reveal the age and gender of a person.

Finally, we have a problem with reporting and remediation. For unstructured data, the unit of analysis for reporting and remediation is a file. For structured data, it is a column, a row or a value. While it is usually straight-forward to quarantine a file, it is very difficult to do the same thing for database rows and columns since they are used by different applications in a variety of ways that are difficult to capture.

New tools hold the promise of help

The complexity of looking for sensitive data in structured data sets is staggering. Most companies are just beginning to think about protection of company data assets and are embarking on discovery efforts to identify all instances of sensitive data in their structured data systems. In the last two to three years, new data discovery tools have emerged, and are finding a receptive audience – particularly in the financial, government and healthcare sectors – where control over sensitive data is a top priority.

While security professionals are not, nor should they be, data management experts, there is a unique place for security knowledge and expertise in the data management field, and vice versa. With the onslaught of security breaches aimed at the core of a company's assets, there will be a continued integration of these two fields.