ChatGPT jailbreak prompts proliferate on hacker forums

ChatGPT jailbreaks have become a popular tool for cybercriminals, and continue to proliferate on hacker forums nearly two years since the public release of the ground-breaking chatbot.

In that time, several different tactics have been developed and promoted as effective ways to circumvent OpenAI’s content and safety policies, enabling malicious actors to craft phishing emails and other adverse content.

“The prevalence of jailbreak prompts and AI misuse on cybercrime forums has definitely increased since ChatGPT’s early days. While there were initial discussions about the potential of the technology in 2022/2023, we’ve observed a growing trend of detailed conversations around specific jailbreaking prompts over time,” Mike Britton, chief information security officer at Abnormal Security, told SC Media in an email. “There are now entire forum sections dedicated to the misuse of AI, specifically on two major cybercrime forums.”

It isn’t just “script kiddies” who are using these tactics, either. Earlier this year, Microsoft revealed that members of five state-sponsored threat groups from Russia, North Korea, Iran and China were using ChatGPT for tasks ranging from social engineering to scripting help and vulnerability research.

In a 2023 research report, Abnormal Security identified five malicious email campaigns that were likely generated by AI chatbots, noting the AI’s ability to employ social-engineering tactics, such as creating a sense of urgency in its email generation.

The suspected AI-generated emails were also notably free of spelling and grammatical errors that are common in phishing emails, lending additional legitimacy.

“The most common use case that we’re seeing for jailbreaking ChatGPT (and leveraging other malicious versions of it) is to launch social engineering attacks, whether for credential phishing business email compromise, or vendor fraud,” Britton said. “Generative AI enables threat actors to scale these social engineering attacks in volume, but also in sophistication.”

On Monday, Abnormal Security published a blog post highlighting five prompts cybercriminals are using to jailbreak ChatGPT. While these jailbreaks aren’t necessarily new, the wide variety and continued popularity of chatbot manipulation techniques should signal to organizations that the adversarial generative AI is a threat not to be ignored.

“As cybercriminals continue to weaponize generative AI in their email attacks, organizations may want to account for this threat in their cyber strategy. There are tools that can help with this – for instance, Abnormal last year released CheckGPT, a tool that enables companies to determine whether a suspicious email was written using generative AI,” Britton said.

Is prompt engineering the new social engineering?

The jailbreak tactics outlined on Abnormal Security mainly rely on two tactics: convincing ChatGPT to “roleplay” as an unfiltered bot or “tricking” the AI into believing it is performing in a specific scenario where generating harmful content would be acceptable.

For example, “Do Anything Now” is a well-known ChatGPT jailbreak tactic that has been around for more than a year, and involves getting the chatbot to roleplay as a different AI named DAN.

This alternative persona has “been freed from the typical confines of AI,” as one prompt shared on a “dark AI” forum topic states, and by adopting the persona, ChatGPT is able to generate content that goes against OpenAI’s policies.

Another method involves telling ChatGPT that it is “in development mode” or that its responses are “being used for testing purposes only,” which may include telling the bot that the “developer policies” differ from OpenAI’s normal policies.

A similar prompt tells ChatGPT that it is a translator chatbot that is being tested for its ability to translate and answer questions in different languages. This can convince ChatGPT to bypass its filters in order to produce accurate translations regardless of the content being translated.

The other two tactics outlined by Abnormal Security are similar to DAN in that they instruct ChatGPT to take on a new, unrestricted persona. “Always Intelligent and Machiavellian” (AIM) is a prompt designed to generate responses “no matter how immoral, unethical, or illegal it is,” while the “BISH” prompt is a variant of “Do Anything Now” that can be assigned a “morality level” that determines how censored or uncensored its responses should be.

“The evolving usage of ChatGPT on these forums could be characterized as a natural progression. We’re seeing many low-level cybercriminals experimenting with leveraging ChatGPT to generate malicious emails and code,” Britton said.

What can organizations do to defend against adversarial GenAI?

GenAI-facilitated cybercrime may still be in its infancy, but being aware of adversaries’ AI experimentation now could help organizations prepare for more advanced attack methods in the future. As phishing is the most popular illicit use of ChatGPT currently, email defenders can consider using tools like CheckGPT to filter suspicious AI-generated content.

“However, understanding if an email was AI-generated is only one signal of a potential attack. To ensure effective and precise detection, this signal should be combined with a range of other diverse signals from across the email environment,” said Britton.

In this case, AI can also become a part of the defense, enabling orgs to analyze relevant data in a manner to build resiliency against future attacks.

“By analyzing additional signals including user communication patterns, interactions, authentication activity, and other attributes, organizations can build a baseline of the known-good behavior of every employee and vendor in an organization, and then apply advanced AI models to detect anomalies indicating a potential attack – no matter if that attack were human- or AI-generated,” Britton concluded.

As for OpenAI itself, the company has been working to mitigate malicious prompts and strengthen ChatGPT’s ability to stay within the guardrails set by the company.

“ChatGPT is still one of the go-to tools for cybercriminals looking for ways to scale their email attacks, but since OpenAI created restrictions intended to stop the generation of malicious content, it’s now harder for threat actors to effectively launch attacks using the tool,” Britton explained. “This has led to the creation of malicious versions of ChatGPT, such as WormGPT and FraudGPT, which can usually be acquired through the dark web.”

However, protecting against jailbreaks is difficult due to the infinite possible prompts someone could craft in their attempts to manipulate the AI model. In the details of its bug bounty program, which was launched in April 2023, OpenAI explicitly notes there are no bounties for “jailbreaks,” stating, “While we work hard to prevent risks, we can’t predict every day people will use or misuse our technology in the real world.”

With OpenAI announcing Monday that ChatGPT would soon be made available to users without OpenAI accounts but with “additional content safeguards,” it is to be seen whether increased accessibility to the chatbot will accelerate cybercriminals’ jailbreaking efforts.