Data leaks – good intentions gone bad

Identity theft is one of the fastest growing crimes worldwide. In the U.S. alone, identify theft happens every 79 seconds, becoming the fastest growing crime in America. In light of the safety net available in technology to mitigate the risk of unintended personal data disclosure the continuing wave of data breaches that are fueling identity theft are simply unacceptable.

Also, the breadth of the problem is also rapidly expanding, personal data breaches are no longer only limited to credit card clearing firms, online banks, brokerage firms and off-shore data clearing firms.

The best of intentions

Good intentions has resulted in several serious incidents of unintentional personal data leakage involving organizations that have not been traditionally thought of as "at risk" for personal data breaches including - publicly released court documents, web search research data and a URL database in a security tool intended to reduce the risk of phishing.

First case in point. The U.S. Federal Energy Regulatory Commission (FERC) had released a massive amount of information that was a result of their investigation of Enron and the Western Energy Crisis. The released information included 92 percent of Enron staff e-mails and over 85,000 records and 150,000 scanned pages of information that was provided to the FERC during the investigation. This information was available to anyone with an Internet connection.

AOL's user search data released

AOL provided another example of Good Intentions Gone Bad in their release of AOL search data for 685,000 of their users in their efforts to gain recognition from the academic/research community. The data was quickly mirrored across the web on multiple servers and was also easily downloadable by anyone with an internet connection.

Google's safe browsing initiative

Google provides a free product as a tool bar add on to alert users that a webpage they are visiting may be asking for personal or financial information under false pretenses.

While the intention is of Google to thwart phishing with a free product is noble, the data in the form of URL updates provided by Google in support of their effort actually exposes personal information itself.

Current tech to plug personal data leakage

However, there are technologies readily available to mitigate the risk of personal data exposure. Here are three different methodologies that are commonly in use today.

Digital rights management (DRM)

DRM-based content management is effective only in maintaining control over specified documents and is not simply effective in securing data. Further, DRM provides no safety net for user error in rights assignment. Hence a wayward or disgruntled document owner or user with access to an unprotected document could potentially assign rights to a third party in order to pass along personal information.

The use of DRM in securing the examples given earlier could have potentially restricted access to the AOL data to only the researchers that it was originally intended for but would not have mitigated the risks of exposure for either FERC or Google where the data was intended to be made generally available to the public.

Traditional secure content management (SCM)

The security afforded in the implementation of a traditional SCM is based in part on the administrative development of a data dictionary. In the simplest of terms the data dictionary contains information such as watermarks, keywords i.e. "password" and "user name" as well as generic templates describing the format of credit card numbers, Social Security numbers, drivers license numbers and other personal information. All content is then filtered against the data dictionary to provide for compliance.

Traditional SCM could have afforded effective risk mitigation in each of the three examples of data leakage given earlier. However, security would have been at the cost of high administrative burden in the development of an effective data dictionary.

Adaptive secure content management (ASCM)

ASCM provides for the granular filtering capability of a traditional SCM without the administrative burden of creating an extensive data dictionary. While still utilizing traditional content analysis of pattern matching ASCM also introduces many additional capabilities to further enhance SCM risk mitigation capabilities operational efficiency including but not limited to:

• Fingerprinting: The fingerprinting engine decomposes a document into a series of algorithm-generated hashes, referred to as the document "fingerprint."

• Adaptive Lexical Analysis: Documents fed into this engine are examined for lexical structures such as frequency of words, and position of words with respect to each other.

• Clustering: The clustering engine is trained on groups of documents or data sets that are similar in nature. The clustering engine scans documents and data to determine whether the document or data is similar to know clusters, which would indicate, protected content.

• Advanced Content Filtering: Allows for searching content using "and" and "or" expressions so that multiple dictionaries and Boolean expressions can be used in combination.

ASCM could have afforded effective risk mitigation in each of the three examples of data leakage given earlier without the high administrative burden of traditional SCM offerings in the development of an effective data dictionary.

Organizations that perhaps would not typically be considered to be at risk for personal data disclosure are finding themselves inadvertently in the middle of serious data disclosure issues. Even with the best of intentions things can go horribly wrong when technology safety nets are not utilized to support the security of personal data.

-Paul Henry is vice president of technology evangelism at Secure Computing.