Diagram showing a larger teacher model generating training data for a smaller student model

What is model distillation?

AI foundations, models and capabilities

Model distillation is the process of training a smaller model, called the student, to imitate the useful behaviour of a stronger model, called the teacher. The goal is to keep much of the teacher's practical capability while reducing run-time cost, latency, or hardware needs. It is a compression and transfer technique, not a model category. A distilled model might be small, but distillation itself is the training method used to get there.

Reviewed by Jackie, Head of Learning & Development, Levellers - Last reviewed 8 June 2026

What this means

The simplest way to think about distillation is teacher and student. You already have a strong model that performs well. Instead of deploying that expensive model everywhere, you use it to generate training signal for a smaller, cheaper model. The smaller model learns to copy the parts that matter for the task.

That training signal can take several forms. It might be the teacher's final answer. It might be its probability distribution over likely answers. It might be a worked rationale, an explanation, or a preferred style of response. The student is then trained on those outputs so that it behaves more like the teacher than it would if it only learned from ordinary labelled data.

This is why distillation is attractive in practice. It lets an organisation use a stronger model to manufacture better supervision for a weaker one. The result is often a model that is cheaper to run and good enough for a narrower job.

It is also why distillation should not be confused with fine-tuning, quantisation, or pruning. Those can all make a model more useful or efficient, but distillation specifically refers to transferring behaviour from a teacher to a student.

Why it matters

For leaders, distillation matters because it changes the economics of repeated AI tasks. If a premium model is excellent but expensive, slow, or difficult to deploy widely, distillation offers a route to capture some of that value in a more deployable form.

This is especially relevant when a workflow is narrow but high-volume. Think support classification, policy checking, template-based drafting, controlled extraction, or guided coding assistance inside a specific environment. In those cases, a smaller distilled student can often deliver most of the practical value at much lower cost per request.

Distillation also matters for privacy and deployment design. A distilled model may be small enough to run in a tighter environment, closer to the data, or under infrastructure you control. That can make AI adoption more feasible in settings that cannot rely entirely on a remote frontier model for every interaction.

Just as importantly, distillation is a way to turn experimental insight into a production asset. A team can explore a task with a stronger teacher model, learn what "good" looks like, and then compress that behaviour into a model better suited to day-to-day operational use.

How it works

Classical distillation starts with a teacher model and a student model. The teacher is usually larger, more capable, or both. The student is smaller, cheaper, or easier to deploy. The idea is not to copy parameters directly. It is to train the student on the teacher's behaviour.

In the earliest forms of distillation, the teacher provided soft targets. Instead of learning only from a hard label such as "class A", the student learned from the teacher's fuller probability distribution, which contains richer information about what alternatives the teacher considered plausible. That helps the student learn class relationships and decision boundaries more effectively.

In language models, distillation can happen at different levels. One approach uses output text. You send representative prompts to the teacher, gather the responses, and train the student on those input-output pairs. Another approach uses token-level probabilities or intermediate representations, which can transfer more nuance if the training stack supports it. A third approach uses rationales or worked intermediate steps, so the student does not only learn the final answer but also some of the reasoning pattern behind it.

A practical distillation project usually has several stages.

First, you choose the task. Distillation works best when the target behaviour is clear. "Be generally brilliant" is too broad. "Classify incoming procurement emails into five categories and extract supplier ID if present" is much better.

Second, you define the teacher. This could be one frontier model, a strong open model, or even an ensemble of several systems. In some cases the teacher is not a single model but a combination of model output and human correction.

Third, you build the distillation dataset. This is a crucial step. The prompts used for distillation should reflect the real traffic and edge cases you care about. If the dataset is narrow, the student will be narrow. If the teacher is prompted poorly, the student may inherit that weakness. If the data is unrepresentative, the project can look good offline and disappoint in practice.

Fourth, you train the student. This may look similar to supervised fine-tuning, but the supervision comes partly or largely from the teacher. In some pipelines, distillation is integrated with fine-tuning so the student learns from both teacher outputs and ground-truth labels. In others, the teacher produces rationales as well as answers, which gives the student richer supervision.

This rationale-based approach is important because it shows that distillation is evolving. It is no longer only about compressing a model's final outputs. Some newer methods aim to transfer a more useful internal pattern of problem solving. A strong example is "distilling step-by-step", where a larger model first produces intermediate rationales and a smaller model is trained on those rationales alongside the final task labels. That can reduce the amount of labelled data needed and sometimes let a smaller model beat standard fine-tuning.

There are also architectural variants. Some distillation projects compress a model during pretraining. Others happen after pretraining for a specific task. Some methods distil intermediate layers, attention patterns, or embeddings. Others use self-distillation, where one model teaches a smaller version of itself or even a later version of itself.

Importantly, distillation does not produce magic. A student model has a capacity ceiling. It will not perfectly retain every capability of a much larger teacher, especially if the target job is broad or reasoning-intensive. Distillation is about selective transfer under constraints.

It also interacts with other efficiency methods. A team might distil first, then quantise the student for deployment. Or it might prune a model and use distillation to recover some lost quality. This is one reason distillation and small language models are related but distinct. Distillation is a method. The resulting student may become an SLM, but the method can also produce specialised mid-sized models.

In production, the project should end with evaluation, not with training. You need a gold standard test set, side-by-side comparisons with the teacher, checks for safety regressions, and realistic traffic sampling. It is common for a student to imitate the teacher's strengths and its shortcuts. So if the teacher is overconfident, brittle, or biased in a particular way, the student may inherit that too.

One more issue has become more visible recently: permission and contract boundaries. Distillation is technically common and legitimate when you are distilling your own models or when the provider allows it. It can become problematic if you use a third-party model's outputs to train a competing model in breach of licence or terms. Leaders need to separate the technical method from the legal right to apply it.

At its best, distillation is disciplined imitation. You use a powerful teacher to generate training signal that would otherwise be expensive to obtain, then compress that signal into a student that is cheaper to run where it counts.

Examples

A support team may begin with a premium model that handles difficult customer emails well. After collecting a representative set of prompts and reviewed responses, the team distils that behaviour into a smaller model that classifies intent, drafts structured notes, and suggests the next step at lower cost.

A policy team may use a stronger teacher model to evaluate whether draft content complies with an internal style or risk standard, then distil that judgement pattern into a specialist student for everyday first-pass review.

A finance operations team may distil a large model's invoice coding behaviour into a smaller internal model used for routine submissions, while still escalating unusual cases to the original teacher or a human reviewer.

A software organisation may use a strong coding model as the teacher for repo-specific tasks, then deploy a smaller student for local assistance, autocomplete, or constrained code transformations inside the development workflow.

Common misunderstandings

One misunderstanding is that distillation is the same as copying a model. It is not. The student is trained to imitate behaviour, not to inherit weights directly.

Another is that distillation and fine-tuning are synonyms. They can look similar in pipeline form, but fine-tuning usually means adapting a model to a task with training data, while distillation specifically uses teacher-generated supervision as part of that adaptation.

A third misunderstanding is that a distilled student will preserve all the teacher's strengths. Usually it preserves the most transferable ones for the chosen task, and loses some breadth.

Teams also assume that if the student is cheaper to run, the project is automatically cheaper overall. Distillation has an upfront cost in data generation, training, evaluation, and governance. It pays off when the target workload is frequent enough.

Finally, some people hear distillation and think it only belongs to frontier labs. In reality, the method is increasingly available through mainstream AI platforms and open tooling, though the quality of execution still matters greatly.

Risks and boundaries

The first risk is inherited error. The teacher's biases, blind spots, and bad habits can be transferred into the student. If the teacher gives polished but wrong reasoning, the student can learn the same pattern.

The second is false confidence in benchmarks. A student may score well on narrow test sets and still fail on the edge cases that matter to your business. Distillation quality depends heavily on the prompts and examples chosen for training.

The third is legal and contractual. Using another provider's outputs to train a competing model may violate licence terms or service terms. That does not change the technical definition of distillation, but it absolutely changes whether you should do it.

The fourth is safety drift. A smaller model trained to imitate helpful responses may not retain all the safety and refusal behaviour of the teacher unless those behaviours are explicitly included and tested.

This article is not legal advice. If you are using third-party model outputs, customer content, or regulated data in a distillation pipeline, review the exact rights, contracts, and data handling duties before proceeding.

What to do next

Choose a single narrow workflow with enough volume to justify the effort. Distillation pays off best when the task is repeated often and the current teacher model is clearly too expensive or too heavy for every request.

Then create a strong evaluation set before training. Include normal cases, hard cases, and business-critical failure cases. Without this, you cannot tell whether the student is genuinely good enough.

Next, confirm rights and inputs. Make sure you are allowed to use the teacher's outputs in the way you intend, and make sure sensitive data is handled appropriately throughout logging, storage, and training.

After that, generate teacher data carefully. Good prompts and reviewed examples matter. Distilling sloppy behaviour just produces a smaller sloppy model.

Finally, deploy gradually. Start with shadow evaluation or limited traffic, measure disagreement with the teacher and humans, and decide explicit fallback paths. A good distillation programme treats the student as an operational model that earns trust step by step.

Have a question or a suggestion, or want to understand how we research and review these guides? Read about our editorial standards and how to reach us.

FAQs

Is distillation the same as model compression?

Distillation is one form of model compression and capability transfer, but not the only one. Quantisation and pruning are other common methods.

Is distillation the same as fine-tuning?

No. Fine-tuning adapts a model with training data. Distillation specifically uses a teacher model's behaviour as supervision.

Does distillation always produce a small language model?

Not always. The student is often smaller, but the core idea is the teacher-student transfer process, not a fixed target size.

Can a distilled model match the teacher?

On a narrow task, sometimes nearly. On broader or harder tasks, usually not completely.

What kind of data does distillation need?

It needs representative prompts and teacher outputs, often combined with gold labels, rationales, or human review for calibration.

Are there legal issues with distillation?

Yes, there can be. The method is technically common, but whether you may use a particular teacher's outputs for training depends on the rights and terms that apply.

Sources

Distilling the Knowledge in a Neural Network (arXiv). Primary academic source for the original teacher-student distillation concept and the idea of compressing stronger systems into cheaper deployable models.
DistilBERT: a distilled version of BERT (arXiv). Primary academic source for a widely known distillation example, including the claim that DistilBERT reduced BERT size by 40 percent while retaining 97 percent of language understanding performance and running 60 percent faster.
TinyBERT: Distilling BERT for Natural Language Understanding (arXiv). Primary academic source for another canonical distillation example, including the two-stage distillation framework and the 7.5x smaller and 9.4x faster result.
AI Insights: Model Distillation (GOV.UK). Primary public-sector source for a concise practical summary of benefits, implementation phases, and use case framing for leaders.

‹ What is a small language model?

What is data residency? ›