What is fine-tuning?

AI foundations, models and capabilities

Fine-tuning is taking a general model and training it further on a smaller, task-specific dataset so it behaves better for a particular use case. You adapt an existing model rather than build one from scratch. It is most useful when prompting alone is not reliable enough and the same task repeats at scale. Fine-tuning changes how a model behaves, not what current facts it knows; for fresh facts you use retrieval instead. It is powerful for consistency and format, but it is not a universal fix and it carries data, cost and lifecycle obligations.

What this means

A foundation model is a generalist, trained broadly so it can do many things passably. Fine-tuning is like onboarding a strong new hire: capable already, but you train them on your specific way of doing one job so they do it consistently and in your house style.

It differs from prompting and from a knowledge base. Prompting gives instructions at run time, every time. Fine-tuning pushes the desired pattern into the model itself, so you need fewer instructions later. A knowledge base, reached through retrieval, supplies current facts. Fine-tuning shapes behaviour and form; it does not keep the model up to date on facts. Keeping that behaviour-versus-knowledge distinction clear is the single most useful idea here.

Why it matters

Prompting has a ceiling. For some tasks, no amount of prompt wording gives reliable, consistent behaviour, especially when the task is narrow, repeats often and must follow a strict format. Fine-tuning raises that ceiling: it improves consistency, can reduce prompt length and cost, and makes behaviour more predictable, which helps governance.

But it is not a universal fix. It adds data handling, cost and a lifecycle to manage. For many problems, a better prompt or retrieval is cheaper and faster. Fine-tuning earns its place only when the problem is genuinely behavioural and high-volume.

How it works

Define the task first

Start by defining exactly what good looks like for one narrow task. Without a clear target, you cannot build a dataset or judge improvement.

Build a dataset of examples

You assemble input and output examples that show the behaviour you want. The most common approach is supervised fine-tuning, where the model learns from labelled input-output pairs. Quality matters more than volume: a smaller, clean, representative dataset usually beats a larger noisy one.

Efficient and preference-based methods

Full fine-tuning updates all of a model's parameters and is expensive. Parameter-efficient methods such as LoRA train a small number of added parameters while freezing the original weights, which cuts cost and memory sharply while matching quality on many tasks; the LoRA authors report reducing trainable parameters by up to 10,000 times and GPU memory by about 3 times compared with full fine-tuning of a 175-billion-parameter model. Beyond supervised learning, preference-based methods align a model to human-preferred responses. Direct Preference Optimization learns directly from preference pairs without a separate reward model, while reinforcement-based methods such as reinforcement learning from human feedback were the approach behind early instruction-following models. These align behaviour and tone rather than teach facts.

Evaluate, then deploy carefully

You evaluate on a held-out set the model never trained on, to check it generalises rather than memorises. Overfitting, where the model parrots training examples but fails on new ones, is the classic failure. After deployment, monitor it and keep guardrails, because behaviour can drift as inputs change.

What good use cases share

Strong candidates are narrow, high-volume, behavioural rather than fact-based, and measurable. If you cannot define a metric, you cannot tell whether fine-tuning worked.

Prompting versus fine-tuning versus retrieval

Prompting is fastest to try and best for one-off or varied tasks. Retrieval is best when the need is current or specific facts. Fine-tuning is best when you need consistent behaviour or a fixed format on a repeating task. Many real systems combine retrieval for facts with fine-tuning for behaviour.

Examples

Support triage: classifying incoming tickets into the right queue consistently, at volume.

Extraction: turning a supplier email into a structured intake form with the same fields every time.

Brand-safe rewriting: rewriting content into a consistent house style across several languages.

HR query classification: routing staff questions to the correct policy area reliably.

Standard case summaries: producing a fixed summary shape for every case, so downstream steps can rely on the format.

Common misunderstandings

"Fine-tuning gives the model fresh knowledge." It mainly changes behaviour. For current facts, use retrieval.

"It competes with retrieval." They are complementary. Behaviour from fine-tuning, facts from retrieval.

"It is only for large enterprises." Efficient methods such as LoRA have lowered the cost enough for smaller teams with a clear, narrow task.

"More data is always better." Quality and representativeness beat raw volume; noisy data harms results.

"It cures hallucination." It does not. A fine-tuned model can still produce confident, wrong output, especially on facts it was never given.

Risks and boundaries

Formalising bad judgement: if your examples encode flawed decisions, fine-tuning makes those flaws consistent and harder to spot.

Lifecycle drift: the world and your inputs change, so a fine-tuned model needs review and sometimes retraining.

Data governance: training data may contain personal or sensitive information, which brings legal and security obligations. UK data protection guidance applies to how that data is used.

Economic cost: data preparation, training, evaluation and ongoing maintenance all cost time and money. Weigh that against simpler approaches.

What to do next

First test whether the problem is behavioural at all. If a better prompt or retrieval would fix it, do that instead. If it is genuinely behavioural, define one narrow, high-volume use case with a clear metric. Build a gold dataset of clean, representative examples. Run a controlled comparison: prompting alone, retrieval, and a fine-tuned model, measured on the same held-out set. Version your datasets and models, and schedule review, because a fine-tuned model is a living asset, not a one-off delivery. The most common reason projects fail is starting with data and tooling before defining the task and the metric.

FAQs

What is the difference between fine-tuning and training from scratch?

Training from scratch builds a model from nothing at huge cost. Fine-tuning adapts an existing model on a smaller dataset, which is far cheaper and faster.

When should I fine-tune instead of writing a better prompt?

When prompting cannot give reliable, consistent behaviour and the task repeats at scale. If a prompt fixes it, prefer the prompt.

Does fine-tuning replace retrieval?

No. Retrieval supplies current facts; fine-tuning shapes behaviour and format. They are often used together.

Can fine-tuning improve structured outputs?

Yes. It is well suited to enforcing a consistent format, such as a fixed set of fields or a standard summary shape.

Is more training data always better?

No. Clean, representative data matters more than volume. Noisy data can make results worse.

Does fine-tuning remove hallucinations?

No. A fine-tuned model can still produce confident, wrong output, particularly on facts it was not given.

What is LoRA?

A parameter-efficient fine-tuning method that trains a small set of added parameters while freezing the original model, cutting cost and memory.

What is the biggest reason fine-tuning projects fail?

Starting without a clear task definition and metric, so there is no way to build the right dataset or to tell whether it worked.

Sources