An AI threat guide, outlining cyberattacks that target or leverage machine learning models, was published by the National Institute of Standards and Technology (NIST) on Jan. 4.
The nearly 100-page paper, titled “Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations,” provides a comprehensive overview of the cybersecurity and privacy risks that come with the rapid development of both predictive and generative AI tools over the last few years.
The new guide on adversarial machine learning (ALM) definitions, classifications and mitigation strategies is part of NIST’s Trustworthy and Responsible AI initiative and is co-authored by NIST computer scientists and experts from Northeastern University and Robust Intelligence, Inc.
Here are four key takeaways from the paper for cybersecurity professionals, AI developers and users of AI tools to keep in mind:
1. ALM attacks can be conducted with little knowledge of models or training data
The NIST guide breaks ALM attacks into three categories based on attacker knowledge: white-box attacks, gray-box attacks and black-box attacks.
Black-box attacks are especially notable, as they involve attackers who have little-to-no knowledge about the model they are attacking, or its training dataset. In contrast, white-box attacks are carried out when a threat actor has full knowledge of the AI system, while gray-box attacks involve incomplete knowledge, such as familiarity with a model’s architecture but not its parameters.
One cannot assume that an AI tool is safe just because it is closed-source or because it comes from a trusted model provider. Black-box attackers, who have the same access as any other member of the public, can leverage various methods to extract model information and private data, and even degrade the performance of certain ML tools.
For example, black-box evasion attacks, threat actors with general query access to a predictive AI model can ply the model with queries designed to build an understanding of its predicted labels and/or confidence scores. This allows attackers to identify weaknesses in the model and ultimately craft adversarial examples that can trick the model into an inaccurate response.
Fending off these attacks is difficult, as research has shown that a relatively small number of queries (less than 1,000) can be used to successfully pull off evasion, making them undeterred by limitations on query volume.
In another scenario, black-box attackers could craft carefully worded prompts to “jailbreak” large-language models (LLM), leading them to output private information or produce malicious content like phishing emails and malware.
2. Generative AI poses unique abuse risk compared with predictive AI
The AI threat taxonomy presented by the authors places ALM attacks into four main categories based on attackers’ goals and objectives: availability breakdowns, integrity violations, privacy compromise and abuse.
While the first three classifications apply to both predictive and generative AI, the abuse category is exclusive to genAI and covers threats that have become a growing concern due to the rapid development of LLMs and image generation tools over the couple years.
Abuse violations involve the weaponization of AI tools to generate malicious content, such as in the scenario of producing phishing emails or writing malware code. This can also include the use of chatbots, image generators and other AI tools to spread disinformation or promote discrimination and hate speech.
Attacker techniques include both direct and indirect prompt injection and data poisoning. One research group demonstrated the use of indirect prompt injection to manipulate Bing’s GPT-4-powered chatbot into incorrectly denying that Albert Einstein won the Nobel Prize.
Additionally, with the emergence of jailbroken and adversarial LLMs like FraudGPT and WormGPT being advertised among online hacking circles, cyber defenders should be aware of this rising category ALM category.
3. Attackers can remotely poison data sources to inject malicious prompts
Indirect prompt injection attacks are a unique form of data poisoning that involve the remote manipulation of the data that models used to inform their outputs. This can include websites, documents and databases that can be edited by the attacker to include malicious content and prompts.
As an example, if an AI tool is trained to pull information from a defunct domain, an adversary could purchase the domain name and upload malicious content. This could include misinformation or hateful content the attacker intends to spread to users, and can even include instructions to the AI that could lead to harmful outputs.
Depending on the instructions provided, indirect prompts can cause the AI to direct users to malicious links, run time-consuming tasks that result in denial-of-service (DoS) or even send a users’ chat data to a third party by getting a chatbot to output and edit an invisible markdown image, as one researcher demonstrated.
Past research has shown that poisoning as little as 0.1% of the dataset used by an AI model can lead to successful targeted manipulation. The NIST paper notes this level of data poisoning is not difficult to achieve, citing work by a research group that showed 0.01% of a major dataset could be poisoned at a cost of just $60.
Those researchers pointed out that crowd-sourced information repositories like Wikipedia create another opening for attackers to indirectly manipulate models that rely on data from the web.
AI developers that leverage webscale datasets should be aware of these risks and leverage mitigations such as output monitoring, reinforcement learning from human feedback (RLHF) and filters to block harmful inputs and outputs.
4. NIST warns ‘no foolproof method’ for protecting AI from attackers
While the paper offers mitigation strategies for many ALM attack types, NIST stated that “no foolproof method exists as yet for protecting AI from misdirection, and AI developers and users should be wary of any who claim otherwise.”
Therefore, the authors warn that security solutions will need to catch up with the threat landscape before emerging AI systems can be safely deployed in critical domains.
Approaches to mitigation will need to consider all the attack categories outlined in the paper, taking into consideration not only attacker knowledge, goals and capabilities, but which stage in a technology’s life cycle that an attack may take place (training stage, deployment stage etc.)
The authors note that protecting AI models and their users will likely involve trade-offs that developers will need to consider when prioritizing properties like privacy, fairness, accuracy and adversarial robustness.
“Despite the significant progress AI and machine learning have made, these technologies are vulnerable to attacks that can cause spectacular failures with dire consequences,” said NIST computer scientist and co-author Apostol Vassilev. “There are theoretical problems with securing AI algorithms that simply haven’t been solved yet. If anyone says differently, they are selling snake oil.”