Concerns over AI data quality gives new meaning to the phrase: ‘garbage in, garbage out’

In the early 1960s, George Fuechsel, an IBM programmer and instructor, coined the term “garbage in, garbage out.” In the years since, the term has been widely adopted by professionals working in many fields. At the end of the day, the term has to do with data quality. If we base assumptions on bad data, we’ll get bad outcomes. Now, as the AI boom takes over, companies are rushing to incorporate this new technology into various applications. This means we should take a moment to heed Fuechsel’s words and review our data quality carefully and understand how it interacts with AI.

The challenge of garbage data can come from two possible directions. First, we could use it unknowingly for training large language models (LLMs), which could result in leaking a great deal of private and potentially protected information. Think of the potential compliance and security headaches from accidentally unleashing financial information, intellectual property, and personal identifying information.

It’s clear that AI can help improve internal workflows and accelerate data sharing across the organization along with any number of other potentially beneficial use cases, but these sensitive data leaks can introduce major governance, privacy, and security risks. In fact, a 2023 survey by Gartner of 249 senior enterprise risk executives found that mass generative AI availability was the second most often mentioned emerging risk.

Sensitive data disclosure has also rated sixth of the top 10 threats for LLMs list compiled by the Open Worldwide Applications Security Project (OWASP). The authors pose a scenario in which unsuspecting users are legitimately interacting with AI tools, and craft a set of prompts to bypass input filters and other data checks that can cause models to reveal sensitive data. “To mitigate this risk, LLM applications should perform adequate data sanitization to prevent user data from entering the training model data,” the authors wrote.

Readers with a sense of history might recall the data leak protection programs of the past, which scanned for Social Security numbers and other sensitive information. But LLMs can combine data from a variety of internal and private and public cloud sources and package this in new ways that make it harder to track and source and therefore protect. Also, unlike traditional data loss prevention (DLP) products, private data gets used to create the LLM, rather than exported for particular and potentially criminal purposes.

So, what does this mean for organizations looking to embark on the LLM gold rush? Organizations need to have a full understanding of where they’re storing sensitive information, who has access to it, and have the ability to track the data as it flows through their organization. They should also impose sufficient access controls to ensure that this data remains within secure storage repositories.

Garbage data doesn’t have to include anything sensitive to present risk and cause potential harm. It could be deliberately or inadvertently inaccurate, which means the models built on top of this data would lack effectiveness, and in some cases, offer dangerous or misleading guidance that the AI adherents gratuitously call “hallucinating.” It’s often as simple as using an outdated data copy in a training dataset, or a developer picking the wrong data set for their models.

It's not a new problem, and IT staffs have had to fight to eliminate shadow data, as it has been called. However, LLMs and AI can greatly accelerate the trend and make these shadow copies quickly stale, as well as proliferate them across an entire enterprise.

For both of these issues – sensitive and shadow data – there are also two practical solutions. First, we need better automation. This means finding ways to closely integrate an LLM with the existing application development environment, to ensure there’s no need to make special copies of the data. In other words, the LLM will live natively in the data cloud, with the existing rights and access controls. Integration also can make it easier to implement security policy rules, similar to what’s done on the networking level.

Second, we need better data classification so models can help structure the data to aid in visualizing the relationships, trends, and other insights to aid in making decisions. Teams can bake this into the model itself, and some a cloud security provider can deliver.

Both of these methods help to make the data more secure, as it remains behind various built-in cloud security protections. The onus remains on the organization to ensure these controls are current and prevent insecure uses. While it’s obvious that organizations need to move quickly to take advantage of the AI revolution, data privacy and responsibility must remain a core tenant of operations.

Ron Reiter, co-founder and CTO, Sentra