What is AI inference?
AI delivery, operations and infrastructure
AI inference is the act of running a trained model on new input to produce an output, such as a prediction, classification, summary, score or generated response. Training is when a model learns from data; inference is when it does the job. Inference is where commercial reality shows up: response times, error rates, running costs, security and quality are all decided here. Inference is not the same as serving, which is the surrounding production machinery, including the request handling, scaling and monitoring that delivers inference reliably to users.
What this means
Think of a model in two modes. Training is study mode, where it learns patterns from large amounts of data, a slow and expensive process done in advance. Inference is exam mode, where it applies what it learned to a new question and returns an answer. Every time you get a response from an AI assistant, classify a document or score a transaction, that is inference.
The distinction matters because most of the day-to-day cost, speed and risk of AI sits in inference, not training. You train once, or occasionally; you run inference constantly. Serving is a related but separate idea: it is the production system around inference that accepts requests, manages load, returns results and watches for problems.
Why it matters
Inference is where cost, speed and reliability are won or lost. A model that is accurate in testing can still be too slow, too expensive or too fragile in production. Those properties are inference and serving concerns, and they shape user experience directly.
It also matters for governance and budgeting. Sensitive data often passes through prompts at inference time, so controls belong here. And because you pay per use, inference cost scales with volume in a way training does not. The good news is that the unit cost has fallen sharply: the Stanford AI Index 2025 reports that the inference cost for a system scoring at the level of GPT-3.5 dropped more than 280-fold, from about 20 dollars per million tokens in November 2022 to about 0.07 dollars per million tokens by October 2024. The trade-off is that whether you run live, per-request inference or overnight batch inference still changes both cost and the experience you can offer.
How it works
The basic flow
Input is prepared, the model runs a forward pass to compute an output, and the result is post-processed and returned. For a language model, the input includes the prompt, and longer prompts mean more to process, which raises both latency and cost.
Latency versus throughput
Latency is how long one response takes; throughput is how many responses you handle per unit of time. Optimising for one can hurt the other. Interactive tools prize low latency; bulk jobs prize high throughput.
Online versus batch
Online inference answers requests live, one or a few at a time, for interactive use. Batch inference processes many items together, often overnight, when immediacy is not needed. A related technique, dynamic or continuous batching, groups incoming live requests on the fly to use the hardware better. Modern serving systems schedule new requests into a running batch as slots free up, which raises throughput substantially; the vLLM project reports up to 24 times higher throughput than the standard Hugging Face Transformers library through this kind of memory management and batching.
Hardware and quantization
Inference runs on CPUs, GPUs, TPUs or other accelerators. Quantization reduces the numerical precision of a model, for example from 16-bit or 32-bit floating point down to 8-bit integer (INT8), 8-bit floating point (FP8) or even 4-bit, to cut memory use and speed up inference. Recent evaluations find FP8 is effectively lossless across model sizes, well-tuned INT8 typically loses only one to three percent of accuracy, and 4-bit weight-only formats are more competitive than once assumed. The point is that quantization involves trade-offs you should measure, not assume away.
The serving layer
A serving layer wraps the model so applications can call it. Established options include ONNX Runtime and NVIDIA Triton; for large language models, systems such as vLLM are widely used because their memory management and continuous batching raise throughput. Retrieval, when used, adds latency because the system must fetch content before generating.
Examples
A customer service assistant answering live, where low latency matters most.
An overnight batch job scoring thousands of invoices, where throughput matters and latency does not.
Edge vision on a production line detecting defects, where inference runs locally for speed and resilience.
Search and summarisation, where retrieval adds a step before the model generates.
Nightly lead scoring, where a model ranks records in bulk for the sales team to act on the next day.
Common misunderstandings
"Inference and training are the same." They are different stages. Training learns; inference applies.
"A live request updates the model." It does not. Standard inference does not change the model's weights.
"Inference is only about model speed." It also covers cost, reliability, security and the serving system around it.
"Batch inference is just slow real-time." Batch is a deliberate design for volume, not a degraded live service.
"Optimisation has no trade-offs." Quantization and batching trade some accuracy or latency for speed or cost. Measure the effect.
"The model is the product." The product is the served system: model plus pipeline, controls and monitoring.
Risks and boundaries
Sensitive inputs in prompts need handling, because data passes through the system at inference time.
Delays and thresholds matter. Slow responses or badly set decision thresholds degrade the experience and can cause wrong actions.
Retrieval staleness undermines answers when the underlying content is out of date.
Under-budgeting inference cost is common, because per-use cost scales with volume. Model it before scaling.
No factual-truth guarantee. A model can produce fluent but wrong output; inference does not verify facts.
The environmental and cost footprint of running models at scale is real and should be tracked.
What to do next
Define the service need first: is this interactive or bulk, and what response time is acceptable? Benchmark on realistic load before scaling, not just on a single test prompt. Separate the model from the pipeline so each can be changed independently. Control input size, since prompt length drives cost and latency. Evaluate optimisation such as quantization on your own data, measuring any accuracy change. Instrument the service for latency, error rate and cost, and design fallbacks for when the model is slow or unavailable. Standardised AI terminology and recognised risk frameworks help you set these expectations consistently across teams.
FAQs
Is inference the same as asking ChatGPT a question?
Yes, that is one example. When the model generates a reply to your prompt, it is running inference.
Does inference train the model?
No. Standard inference applies a fixed model and does not change its weights. Learning happens in training.
What is the difference between inference and serving?
Inference is running the model on input. Serving is the production machinery around it that handles requests, scaling and monitoring.
What is the difference between batch and online inference?
Online answers requests live for interactive use. Batch processes many items together, often overnight, when immediacy is not required.
Why does prompt length matter?
Longer prompts give the model more to process, which raises both latency and cost for each response.
What is quantization?
Reducing a model's numerical precision, for example to 8-bit integer, 8-bit floating point or 4-bit, to cut memory and speed up inference, with a trade-off in accuracy you should measure.
Why is inference cost a concern if models are cheap to call?
Because you pay per use and volume adds up. Unit costs have fallen sharply, but high-volume workloads still need budgeting.
What hardware runs inference?
CPUs, GPUs, TPUs and other accelerators, chosen by workload, latency target and cost.
Sources
ISO/IEC 22989:2022 Information technology, Artificial intelligence, AI concepts and terminology (ISO/IEC). Standard definitions for training, inference and related AI concepts.
Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1 (NIST). Risk and measurement framing for deployed AI systems including testing and monitoring.
The 2025 AI Index Report (Stanford HAI). Evidence on the more-than-280-fold fall in inference cost and improving hardware efficiency.
Give Me BF16 or Give Me Death? Accuracy-Performance Trade-Offs in LLM Quantization (arXiv). Current evidence on FP8, INT8 and 4-bit quantization accuracy and deployment trade-offs.
