Diagram showing trusted and untrusted training data flowing into a model and the points where poisoning can occur

What is data poisoning?

Privacy, security and identity

Data poisoning is the deliberate manipulation of the data a machine learning system learns from so that the model behaves badly later. Sometimes the effect is broad degradation. Sometimes it is a hidden, targeted behaviour that appears only under specific conditions. Data poisoning sits under the wider adversarial machine learning umbrella and is distinct from an AI jailbreak, which tries to bypass a model's rules at run time through prompts rather than corrupting the data it learned from.

Reviewed by Jackie, Head of Learning & Development, Levellers · Last reviewed 8 June 2026

What this means

A simple way to understand data poisoning is to compare it with training a new member of staff using tampered manuals and bad examples. If the training material is wrong in the right way, the learner may still look competent most of the time but fail in exactly the places the attacker wants.

That is what happens with poisoned machine learning data. The attacker alters some of the examples, labels, or other learning inputs so the model absorbs a distorted pattern. The model is then deployed as if it were trustworthy, even though its behaviour has been quietly shaped in advance.

Sometimes the effect is general. The model just works worse overall. Sometimes it is targeted. Certain inputs, phrases, classes, or conditions trigger the wrong behaviour while everything else still looks normal. That targeted version can be much harder to notice because headline performance may remain acceptable.

In modern AI systems, "data" also means more than one thing. It can include pre training corpora, fine tuning datasets, human labelled examples, ranking and preference data, embedding corpora, and in some contexts feedback data that is later reused for improvement. That wider supply chain is why data poisoning is not only a data science issue. It is also a procurement, platform, and governance issue.

Why it matters

Data poisoning matters because the model's behaviour is downstream of the data it learned from. If the data is compromised, the model can be compromised even when the software around it looks normal.

This is especially important now because many organisations do not build everything from scratch. They use third party datasets, open repositories, external labellers, pretrained models, shared model hubs, partner supplied data feeds, feedback loops, and automated retraining pipelines. Each dependency can be useful, but each one can also widen the opportunity for contamination.

For leaders, the danger is not only lower accuracy. Poisoned systems can leak trust in quieter ways. A fraud model may miss the wrong patterns. A moderation model may develop blind spots. A language model fine tuned on corrupted material may produce biased or manipulated responses to specific topics. A hidden backdoor may remain dormant until a trigger appears in production. That means data poisoning can sit unnoticed until it matters most.

How it works

Data poisoning takes place before or during the learning process, not mainly at the point of user prompting. The attacker needs some route to influence the data the model uses to learn. That route may be direct or indirect. It could involve altering a dataset, corrupting labels, introducing crafted samples, publishing contaminated training material that others later ingest, or abusing a feedback or update process.

At a broad level, there are two common effects. The first is indiscriminate degradation. The attacker wants the model to become worse in general, perhaps less accurate or less reliable across many inputs. The second is targeted manipulation. The attacker wants the model to behave badly for a narrow set of inputs, classes or triggers while remaining mostly normal elsewhere.

Targeted manipulation is often more dangerous in practice because it hides better. A model can pass many ordinary tests and still contain a latent harmful pattern. In classic security language, this may be described as a backdoor. The model behaves normally until the right trigger or condition appears.

Current taxonomies also distinguish data poisoning from model poisoning. Data poisoning changes the examples the model learns from. Model poisoning changes the model artefact or parameters more directly, which is especially relevant in settings such as federated learning or compromised model supply chains. In everyday business discussion the terms are sometimes blurred, but keeping them separate helps teams decide where to place controls.

Generative AI expands the field further. In this context, poisoning can target pre training data, instruction tuning data, preference data, embedding data, or other artefacts used to shape model behaviour. Current security guidance also notes that models published through shared repositories can carry both behavioural and software supply chain risk. In other words, poisoning is not only about the text or images inside a dataset. It can also be part of the wider chain through which models and related artefacts are acquired.

It also matters to distinguish data poisoning from adjacent issues. A user prompt that tricks a deployed assistant is not data poisoning. That is a run time prompt based attack. A malicious instruction hidden in a retrieved document is usually treated as indirect prompt injection or knowledge base poisoning, unless that content is later incorporated into the training or fine tuning process itself. Good governance depends on keeping these categories clear.

From a defensive point of view, the important question is not "Could this happen in theory?" It is "Where could someone influence our learning data in practice?" Common leverage points include external datasets, outsourced labelling, web scale scraping, unreviewed user feedback, compromised storage, weak access control, automated retraining, and model imports from public hubs.

Control therefore starts with provenance and ownership. You need to know where data came from, who labelled it, who changed it, when it changed, and which model versions learned from it.

Then you need integrity controls. Version datasets. Separate trusted and untrusted sources. Restrict write access. Use approvals for important data changes. Preserve artefacts needed for rollback and investigation.

Validation also matters. Holdout datasets, canary tests, anomaly checks, and evaluation on targeted edge cases can help spot suspicious changes that broad benchmark averages may hide. Ensembles and comparison against trusted sources can also make poisoning harder to hide.

Supply chain discipline is essential. If you import models, datasets or tools, verify their source and trust posture. Modern guidance increasingly treats the ML supply chain more like the software supply chain, meaning provenance, transparency and dependency review all become part of responsible operations.

Finally, monitor after deployment. Data poisoning is often discovered late, when a model starts behaving oddly on live traffic. Post deployment monitoring for drift, narrow failure patterns, and unexplained performance changes helps teams catch issues earlier, though it will not detect everything.

So the mechanics are simple to describe even if they are hard to manage. The attacker influences the learning material. The model internalises the distortion. The harmful behaviour appears later, either broadly or under specific conditions. The practical defence is to treat training and tuning data as security critical assets, not just raw material for experimentation.

Examples

A spam filter is trained on email data that includes manipulated labels. The model learns the wrong patterns and lets through messages it should have blocked. This is a classic case where bad learning data creates bad operational behaviour later.

A customer support model is fine tuned on historical tickets. If those tickets contain systematically corrupted examples, or if the curation process is manipulated, the model may develop narrow but serious blind spots around specific issues.

A model team imports a third party base model or dataset from a public repository without sufficient provenance checks. The immediate appeal is speed. The hidden cost is uncertainty about what the model has learned and whether hidden behaviours were planted upstream.

A team uses live feedback loops for continuous improvement. If that feedback is not screened and governed, malicious or low quality signals may start shaping future behaviour in ways nobody intended.

Common misunderstandings

One misunderstanding is that poisoning always ruins model performance so obviously that anyone will spot it. Not necessarily. Some poisoning aims for subtle, targeted effects that leave many standard metrics looking acceptable.

Another is that poisoning requires control over a huge share of the data. Sometimes it does not. Under the right conditions, carefully placed changes can have outsized effects.

A third is that data poisoning and model poisoning are the same thing. They are related but distinct. Data poisoning changes learning inputs. Model poisoning changes the model more directly.

A fourth is that this is only a problem for giant foundation models. It also affects smaller internal models, fine tuned task models, and any workflow that retrains from data the organisation does not rigorously govern.

Risks and boundaries

Data poisoning is difficult to detect with certainty, especially when the effect is narrow and the system is large. Some organisations will never have full visibility into every upstream pre training source used by a supplier. That limits assurance and increases the value of supplier due diligence, controlled deployment, and post deployment monitoring.

There are trade offs too. Wider data collection can improve coverage but also widen exposure. Fast retraining can improve recency but can import bad signal more quickly. Open sharing can speed adoption but reduce assurance unless provenance and validation are strong.

This article is a practical explainer, not legal or professional advice. If poisoned behaviour could affect regulated decisions, safety critical activity, or sensitive personal data, formal security, legal and risk review is sensible.

What to do next

First, map the data supply chain for every material model. Include sources, labellers, repositories, artefact stores, retraining triggers, and who can approve changes.

Second, assign ownership. Someone should be accountable for the integrity of training, tuning, embedding and evaluation data, not only for model performance.

Third, apply provenance and version control. Important datasets should be traceable, reproducible and reversible. If a model changes, you should know exactly which learning inputs changed.

Fourth, separate trust levels. Keep high trust internal data separate from low trust external data, and do not let unreviewed feedback enter improvement loops automatically for high risk use cases.

Fifth, tighten supplier checks. Ask vendors about dataset provenance, labelling controls, model hub practices, retraining governance, and how they detect or investigate contamination.

Sixth, build targeted evaluation. Do not rely only on aggregate scores. Use canary cases, edge cases, and pattern specific tests that could reveal hidden manipulation.

Finally, prepare rollback and incident response. If suspicious behaviour appears, teams should be able to pause updates, compare versions, inspect the changed data path, and revert safely.

Have a question or a suggestion, or want to understand how we research and review these guides? Read about our editorial standards and how to reach us.

FAQs

Is data poisoning the same as an AI jailbreak?

No. A jailbreak is a run time attempt to bypass a deployed model's restrictions through prompts. Data poisoning changes what the model learns before it is deployed or updated.

Does poisoning always make the model obviously worse?

No. Some poisoning causes broad degradation, but some is targeted and designed to stay hidden until a particular trigger or condition appears.

Is this only a training dataset problem?

No. It can affect pre training, fine tuning, preference data, embeddings, feedback loops, and other learning artefacts that shape model behaviour.

What is the difference between data poisoning and model poisoning?

Data poisoning alters learning inputs such as samples or labels. Model poisoning alters the model more directly, such as by changing parameters or updates in certain training settings.

Can post deployment monitoring help?

Yes, but it is not enough on its own. Monitoring can reveal drift or suspicious behaviour, but preventing contamination through provenance, access control and validation is stronger.

What should leaders ask suppliers?

Ask about data provenance, labelling governance, retraining controls, model supply chain checks, anomaly detection, rollback processes, and what evidence the supplier can share if contamination is suspected.

Sources

Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations (National Institute of Standards and Technology). Primary. Definitions of data poisoning, distinction from model poisoning, generative AI relevance, and the role of targeted and backdoor poisoning.
Understanding adversarial attacks against Machine Learning and AI (National Cyber Security Centre). Primary. Supports the plain English definition of training data poisoning and the distinction from malicious model training and other AML classes.
ML02:2023 Data Poisoning Attack (OWASP Foundation). Primary. Practical prevention guidance covering validation, secure storage, access control, monitoring and model validation.
LLM04:2025 Data and Model Poisoning (OWASP Gen AI Security Project). Primary. Supports generative AI specific poisoning discussion across pre training, fine tuning and embedding data.
Secure your supply chain (National Cyber Security Centre). Primary. Supports the point that data and labels determine model behaviour, that third party datasets widen risk, and that targeted backdoors can be introduced through poisoned data.
Poisoning Attacks Against Machine Learning (National Institute of Standards and Technology). Secondary. Corroborates the distinction between data poisoning and model poisoning and the broader consequences of poisoned learning material.

‹ What is adversarial machine learning?

What is RAG? A technical guide ›