Adversarial Machine Learning is Fighting Back
Hackers and other adversaries have found hot new targets in AI and machine learning apps
Although some of us are adapting faster than others, most of us are getting used to the notion that artificial intelligence and machine learning are beginning to make our lives a bit easier, even while we recognize some of the downsides of AI. (Let’s face it, if today’s typical chatbot experience was our only contact with AI, the future would look pretty grim.)
Unhelpful, poorly trained chatbots aside, AI and machine learning bring us conveniences like traffic predictions and alternate route suggestions, converting speech to text, online shopping recommendations, language translations, image recognition and object detection functions, some decent customer service triage, and those notorious self-driving vehicles, to name just a few. Most of these, and a whole lot more, are here to stay.
Machine Learning Defined
Machine learning is a subset of artificial intelligence. AI is a broad field focused on developing machines that are capable of performing tasks that typically require human intelligence.
Machine learning is a narrower field, within the realm of Predictive AI, which is focused on enabling machines to learn from data in order to make predictions, decisions, or recommendations. It’s a method of teaching computers to learn from data—much like how humans learn from experience—without having to be explicitly programmed by humans.
Understanding How Machine Learning Works
Machines learn by being trained. Machine learning algorithms are trained on large datasets, which they use to learn patterns and relationships. They adjust their parameters based on changes in the data they are exposed to, which is what enables the algorithms to improve their performance over time. And once they have been trained, machine learning models can be used to make predictions or decisions about new, unknown, or unlabeled data.
At this stage of AI evolution, the potential benefits of these predictive models are virtually unlimited. And, as we have seen, these models have already begun enabling advances in a wide variety of fields, including healthcare and cybersecurity, with the promise of many more to come.
Opinions vary as to whether there are three, four, or five basic types of machine learning. The three fundamentals include supervised, unsupervised, and reinforcement learning. Additional types include semi-supervised and self-supervised learning. While IBM offers good descriptions of all five machine learning types, the basic three are outlined below.
- In Supervised learning, the model is trained on a labeled dataset, meaning the target or outcome variable is known. For data scientists constructing a model for tornado forecasting, for example, the input might include date, location, temperature, and wind flow patterns, while the output would be the actual tornado activity recorded for those parameters. Supervised learning is commonly used for risk assessment, image recognition, predictive analytics, and fraud detection.
- In Unsupervised learning, algorithms learn from unlabeled data to identify patterns, structures, and hierarchies within the data. The most common unsupervised learning method uses clustering algorithms to categorize data points according to the similarity of their values, as in customer segmentation or anomaly detection. Unsupervised machine learning models are often behind the “customers who bought this also bought…” types of recommendation systems, according to IBM.
- Reinforcement learning, also called reinforcement learning from human feedback (RLHF), uses dynamic programming that trains algorithms through rewards and punishments. In reinforcement learning, an agent takes actions in a specific environment to reach a predetermined goal. The agent is rewarded or penalized for its actions based on an established metric, which trains the agent to continue good practices and discard bad ones. With repetition and reinforcement, the algorithm learns the most desirable strategies. Reinforcement learning algorithms are common in video game development and are frequently used to teach robots how to replicate human tasks.
Because data is at the heart of machine learning, and because it empowers so many useful new applications across such a wide range of industries, attackers and hackers have begun to exploit machine learning and the data it collects, employs, and stores.
The Rise of Adversarial Attacks on Machine Learning
Adversarial attacks are a type of cyberattack that specifically targets machine learning models. Their objective is to manipulate model behavior or compromise model (or data) integrity. When outcomes are compromised, the machine learning model becomes unreliable and unusable—at least until the attack is discovered and the model is restored or replaced.
When machine learning models are used to power critical applications, like those in medical diagnostics and financial fraud detection for example, they become highly attractive targets for adversarial attacks.
Adversarial attacks occur at all stages of the machine learning lifecycle, from design and implementation to training, testing, and deployment. Therefore, security measures must address each stage of machine learning in order to thwart attacks effectively.
Currently, adversarial attacks are most common at either the training or the deployment stage.
- During the training stage, an attacker might control part of the training data, the data labels, the model parameters, or the code of machine learning algorithms. These generally result in different types of poisoning attacks.
- During the deployment stage, the machine learning model has already been trained, so an adversary might launch an evasion attack to create integrity violations and alter the model’s predictions. Or a privacy attack could be launched to infer sensitive information about the training data or the model itself.
Adversarial attacks can vary widely based on a number of factors, and may have far-reaching impacts. Following are three examples of attacks and their outcomes. Attacks may create inputs (like images or text) that have been subtly altered to fool a machine learning model into making incorrect predictions.
- Attacks may manipulate the training data to make a model learn incorrectly or perform with bias (aptly called model poisoning).
- Attacks may be aimed at discovering the internal parameters or architecture of a particular model by observing its behavior (called model extraction), with the goal of altering that behavior.
Just as cybercriminals specialize in phishing schemes or ransomware exploits, as two examples, the newest form of cybercrime is targeting machine learning applications in order to steal data or to render it useless, damaging, or destructive.
A Forbes article in 2023 presented real-world examples of the damage caused by adversarial attacks. As one example, an attack can deceive a facial recognition system by altering an individual's facial image with changes undetectable to the human eye. The biometric system fails to recognize the individual, prohibiting their approved access and creating confusion. In another example, an adversarial attack can bypass an intrusion detection system in cybersecurity by creating malicious network traffic that the system recognizes as normal, enabling the attacker to gain unauthorized access to the network.
The Role of Adversarial Machine Learning in Fighting Off Attacks
The field of adversarial machine learning (AML) focuses on understanding and mitigating vulnerabilities in machine learning models by studying how hackers and other adversaries can manipulate inputs or training data to cause incorrect or undesired outcomes—sending designers and developers back to the drawing board to create more robust machine learning models. (Unless they follow Secure By Design principles promoted by CISA.)
AML investigates how hackers can exploit weaknesses in machine learning models, such as data dependency, interpretability issues, potential for bias, resource consumption, and accuracy issues. Adversarial machine learning aims to develop techniques for attacking machine learning models as hackers might attack them, and then to create safeguards to protect against those attacks.
Adversarial machine learning is an essential security component of machine learning, and it will be crucial to the ongoing advancement of artificial intelligence and machine learning applications.
NIST Guidance on Use of Adversarial Machine Learning
The National Institute of Standards and Technology (NIST) has been steadily monitoring the evolution of artificial intelligence and machine learning applications, as well as the threats and risks that emerge with those evolutionary changes.
In March 2025, NIST released its finalized guidelines on adversarial machine learning in the form of the NIST Trustworthy and Responsible AI Report No. 100-2e2025. The report organizes concepts and defines terminology in the field of adversarial machine learning, and describes the key types of current machine learning methods, the life cycle stages of attacks, and attacker goals, objectives, capabilities, and knowledge. The report identifies current attacks in the life cycle of machine learning systems and describes methods for managing and mitigating the consequences of those attacks, where mitigation techniques exist.
Highly technical in nature and heavily footnoted, the report is intended for use by individuals and teams who are responsible for designing, developing, evaluating, deploying, and managing AI and AML systems and ensuring their integrity, privacy, and security. It is hoped that the report will influence other standards and future best practices for assessing and managing the security and privacy of AI and AML systems.
Following is a brief overview of the primary types of attacks on Predictive AI and Generative AI outlined in the NIST report.
Adversarial Attacks on Predictive AI
Three primary categories of attacks on predictive machine learning are evasion attacks, poisoning attacks, and privacy attacks. However, there are numerous sub-categories. For example, poisoning attacks may include data poisoning, targeted poisoning, backdoor poisoning, and clean-label poisoning.
Evasion Attacks. In these attacks, the hacker’s goal is to generate samples that can be reclassified to some arbitrary class determined by the attacker. In terms of image classification, for example, the change in the original sample might be undetectable to the human eye, but the machine learning model is tricked into placing it in the target class selected by the attacker, rather than the class in which it belongs.
Poisoning Attacks. Broadly defined, poisoning attacks are adversarial attacks during the training stage of the machine learning algorithm, during which hackers insert malicious or altered data into a training dataset to manipulate a model's learning and decision-making processes.
Poisoning attacks are powerful and can cause indiscriminate damage to the machine learning model. They leverage a wide range of adversarial capabilities, such as data poisoning, model poisoning, label control, source code control, and test data control.
Privacy Attacks. These include data reconstruction, membership inference, property inference, and model extraction attacks. As one example, data reconstruction attacks and membership inference attacks are able to recover an individual’s private data from aggregate information.
Adversarial Attacks on Generative AI
Generative AI develops models that can generate content, such as images, text, and other media, with similar properties to their training data. Generative AI includes several different types of technologies with distinct origins, modeling approaches, and related properties.
Although many attack types in Predictive AI are also observed in Generative AI (such as data poisoning, model poisoning, and model extraction), recent research has also found adversarial machine learning attacks that are specific to Generative AI systems, according to the NIST report. The three categories of such attacks include supply chain attacks, direct prompting attacks, and indirect prompt injection attacks, as described below.
- Supply Chain Attacks. Many vulnerabilities of the traditional software supply chain, such as reliance on third party dependencies, also plaque Generative AI systems. In addition, there are new specific dependencies including data collection and scoring and integration of third party-developed AI models and plugins.
- Direct Prompting Attacks. These occur when the primary user of the system is the attacker, who interacts with the machine learning model through query access. A wide range of techniques can be used to launch these attacks, depending on the attacker’s objectives. Typical objectives involve enabling misuse, invading system or user privacy, or violating system integrity.
- Indirect Prompt Injection Attacks. These are enabled by resource control that allows an attacker to remotely inject system prompts without directly interacting with the application. Unlike direct prompting attacks, indirect prompt injection attacks are conducted by a third party rather than by the model’s primary user.
The NIST report offers 127 pages of insights into adversarial attacks and adversarial machine learning.
Summary
Adversarial machine learning is a rapidly emerging discipline driven by the objective of understanding how hackers can manipulate data to cause incorrect or undesired outcomes in machine learning applications. By exploring how attackers can exploit weaknesses in machine learning models, the emerging field of adversarial machine learning can help lead to solutions to address those weaknesses and other vulnerabilities being discovered.
Introducing more robust security into all stages of the machine learning life cycle is becoming increasingly important, and will enable those responsible for the design, development, implementation, and management of artificial intelligence and machine learning systems to continue to advance applications and identify new uses for these promising technologies.