The emergence of publicly available Large Language Models (LLMs) over the past year has upended many aspects of industry and society.

The ability of models like ChatGPT to process outside prompts and produce (in some cases) organized, actionable responses that are drawn from massive training sets of data has delighted many users, created a wide range of potential use cases and expanded our collective understanding of what AI is capable of in the modern age.

Almost as soon as they began getting widespread use by the public, however, these LLMs started encountering a whole host of problems, from “hallucination” (inventing facts, studies or events wholesale) and providing inaccurate or incorrect information to the other end of the spectrum, providing accurate (but objectionable, harmful or dangerous) answers to prompts like “How do I build a bomb?” or “Write a program to exploit this vulnerability.”

Companies like OpenAI have attempted to put in place guardrails that prevent their models from giving actionable answers to these kinds of questions. Often times, those parameters can be bypassed by structuring prompts in a specific way to avoid triggering any newly added restrictions, but researchers continue to discover new ways to circumvent them.

Now, a new study from researchers at Carnegie Mellon University, The Center for AI Safety and the Bosch Center for AI claims to have discovered a simple addition to these questions that can be used to jailbreak many of the most popular LLMs in use today. This prompt addendum allowed the researchers to not only trick the models into providing answers to dangerous or sensitive questions, it’s also simple enough to be automated and used for a wide range of different commercial and open-source products, including ChatGPT, Google Bard, Meta’s LLaMA, Vicuna, Claude and others.

“Our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer),” write authors Andy Zou, Zifan Wang, Zico Coulter and Matt Fredrikson.

A diagram depicting different methods of prompt injection attacks used by researchers across multiple commercial and open source Large Language Models (Source: Carnegie Mellon University)
A diagram depicting different methods of prompt injection attacks used by researchers across multiple commercial and open source Large Language Models (Source)

Many previous jailbreak techniques discovered for prompt injection have been manual, ad hoc and restricted to a narrow range of questions or models. Even small changes, like asking the model to begin their answers with an affirmative response to the question (such as “Sure, here is how…”) can increase the likelihood of getting a substantive response to banned questions.

The new suffix has proven to be highly effective across different models.

According to the paper, the researchers were able to use the additions to reliably coax forbidden answers for Vicuna (99%), ChatGPT 3.5 and 4.0 (up to 84%), PaLM-2 (66%), while other models like Claude saw drastically lower success rates (2.1%).

Success rates for different types of prompts across multiple Large Language Models. (Source: Carnegie Mellon University)
Success rates for different types of prompts across multiple Large Language Models. (Source)

The researchers said they shared their data with OpenAI, Google, Meta and Anthropic ahead of publication, and while many of the specific methods they outline may now be closed off, the findings strongly indicate that the isolated and “posthoc” whack-a-mole approach taken by many AI proprietors around adversarial prompting must change if their models are going to be integrated into mainstream society.

“It remains to be seen how this ‘arms race’ between adversarial attacks and defenses plays out in the LLM space, but historical precedent suggests that we should consider rigorous wholesale alternatives to current attempts, which aim at posthoc ‘repair’ of underlying models that are already capable of generating harmful content,” the authors write.

Whether these weaknesses can be mostly or entirely fixed or are fundamentally ingrained in these models is an open question, according to one of the researchers.

“It's uncertain. Adversarial examples in vision have persisted for over a decade without a satisfactory solution, tweeted Zou, one of the authors, on July 27. “It's unclear if this will fundamentally restrict the applicability of LLMs. We hope our work can spur future research in these directions.”

SC Media has reached out to OpenAI and Meta for comment.

A Google spokesperson told SC Media the company continues to fine-tune its Bard Safety Classifier to better detect and respond to prompt injection attacks and has a dedicated red team in place that performs internal testing to identify and address new prompts that can bypass the rules built into Bard.

“We conduct rigorous testing to make these experiences safe for our users, including training the model to defend against malicious prompts and employing methods like Constitutional AI to improve Bard’s ability to respond to sensitive prompts,” the spokesperson said in an emailed response to questions. “While this is an issue across LLMs, we've built important guardrails into Bard – like the ones posited by this research – that we'll continue to improve over time."

Security community unprepared for LLM risks

In the cybersecurity space, LLMs have provided a playground for security researchers, both to find interesting ways to support defenders — like writing malware detection rules or mapping out a vulnerability’s likely impact — as well as breaking through the restraints put in place by their owners and discover new ways to use it for malicious purposes.

Chloé Messdaghi, head of threat research at Protect AI, a startup focused on artificial intelligence and machine learning security that just took in $35 million in Series A funding from investors this week, told SC Media that the Carnegie Mellon University paper and its conclusions demonstrate how the security community is probably “not prepared” to evaluate some of the risks that come with emerging LLMs as they are integrated into society.

Another ProtectAI researcher, Dan McInerney, said while he does believe the owners of these models could do more to reduce the attack surface around prompt injection attacks, the current approach to building guardrails sensitive or controversial questions is “almost pointless” because of how easily they can be circumvented. Additionally, these models are — by design — meant to draw from already available open source data, meaning anyone who wanted to learn to build a bomb can do so relatively easily.

The way AI companies have been responding to these flaws seems to be as much about protecting their own brand and reputation with the public and regulators as it is about finding a durable solution to the problem.

“At the end of the day, curiosity is always going to exist. As long as there is curiosity around the world, there’s always going to be these issues” with AI, said Messdaghi.