A machine learning workflow moving from training and validation to deployment, monitoring, and retraining

What is MLOps?

AI delivery, operations and infrastructure

MLOps means Machine Learning Operations. It is the operating discipline used to move machine learning from notebooks, experiments, and proofs of concept into reliable production use. In practice, that means controlling data inputs, versioning code and models, testing training and deployment pipelines, monitoring live behaviour, handling drift, managing retraining, documenting ownership, and being able to roll back when a change causes harm. MLOps is not just "put the model in production". It is the ongoing work needed to keep a modelled service dependable, reviewable, and worth running.

Reviewed by Jackie, Head of Learning & Development, Levellers · Last reviewed 8 June 2026

What this means

A simple way to think about MLOps is this: ordinary software already needs release discipline, but machine learning adds another moving part. Instead of only shipping code, you are shipping a model whose behaviour depends on training data, feature generation, evaluation choices, runtime conditions, and the real world staying similar enough to the assumptions made during development.

That creates a practical problem for leaders and operators. A model can appear impressive in a demo and still be fragile in production. It may work well on historical data but degrade when customer behaviour changes. It may be retrained with new data and silently get worse. It may rely on undocumented feature logic or a pipeline nobody can reproduce. MLOps exists to reduce those risks.

In a small or mid-sized organisation, MLOps does not have to mean an enormous platform team. It often starts with lighter habits: storing training code properly, versioning datasets or dataset references, documenting model purpose and owner, checking performance before release, logging real-world behaviour, and deciding in advance what to do when the model starts drifting or fails in production.

Why it matters

MLOps matters because a machine learning system is usually more than a model file. It depends on data engineering, feature preparation, application code, APIs, permissions, storage, runtime environments, support processes, and some form of governance. If any of those pieces are weak, the model can become expensive, untrusted, or unsafe to operate.

That matters even for organisations that are not training frontier models. A retailer doing demand forecasting, a service firm using lead scoring, or an operations team ranking cases by priority can still create real business risk if model behaviour is poorly controlled. Wrong outputs can affect staffing, customer treatment, service quality, or cost allocation. If nobody can say which data version trained the model, which threshold was approved, or why performance changed last month, the organisation is not really operating ML; it is improvising around it.

MLOps also matters because ML systems age. Features change. Source systems change. The world the model learned from changes. Threats change too. Training pipelines and model endpoints need the same kind of disciplined change control and secure development thinking that other important systems need. That is where MLOps connects naturally to ETL and ELT work, API-dependent services, IAM, DLP controls, the NIST AI RMF, ISO 42001 thinking, and a realistic view of total cost. A model is only valuable if it keeps working in context and can be governed over time.

How it works

In practice, MLOps starts before deployment. Teams need a repeatable way to organise data, code, experiments, and evaluation. Training code should live in version control. The data used for training should be identifiable, not just described vaguely as "latest export". Feature logic needs to be knowable and reproducible. Evaluation criteria should be written down so people can see what "good enough" means before the model reaches production.

A workable MLOps flow often includes several connected loops. The first loop is development: experimenting, selecting features, training models, and comparing results. The second is operationalisation: packaging the training process and environment so it can be rerun consistently. The third is deployment: promoting a tested model into a serving environment through a controlled release path. The fourth is live operation: monitoring predictions, data quality, service health, and business outcomes once the model is in use.

Monitoring is where many weak ML programmes start to show strain. It is not enough to know that an endpoint is up. You also need to know whether the inputs still look like the data the model expects, whether output distributions are shifting, whether business performance is weakening, and whether downstream users are seeing odd behaviour. Depending on the use case, that can mean watching data drift, prediction drift, feature attribution drift, error rates, latency, fallbacks, and selected outcome measures. Monitoring should start early, because a model that cannot be observed cannot be managed.

Retraining also needs discipline. New data does not automatically mean better performance. A good MLOps process treats retraining as a controlled change, not as an automatic ritual. You need validation steps, comparison against the current model, approval thresholds, and a rollback path if the new model underperforms. In some settings, scheduled retraining makes sense. In others, drift-based or event-based retraining is safer. Either way, it should have an owner and a documented trigger.

The operational side matters as much as the modelling side. Teams should know who owns the model, who owns the data pipeline, who owns the API or service using the model, and who responds when something breaks. Incident response should cover model failures as well as infrastructure failures. If a scoring service degrades, produces implausible outputs, or starts operating on corrupted data, the organisation should know whether to pause, fall back, threshold differently, or revert to a prior version.

This is also why MLOps is not only a toolchain. Registries, orchestration systems, metadata stores, and monitoring platforms can help, but the discipline is broader than the stack. MLOps is the combination of workflow, ownership, quality checks, release control, logging, documentation, and operational learning that makes machine learning usable after the proof of concept stage.

Examples

A logistics firm builds a model to predict failed deliveries. The data science team gets strong results in experiments and wants to deploy quickly. MLOps changes the conversation from "did the model score well?" to "can this service run reliably next month?" The firm versions the training dataset reference, stores the feature logic in code, defines acceptable precision and recall thresholds, packages the scoring environment, and adds production monitoring for feature drift and service latency. When a new postcode format appears in an upstream system, monitoring catches the issue before silent misclassification spreads across the operation.

A financial operations team uses a model to prioritise payment anomalies for human review. The first release works, but later a retraining run changes the alert distribution dramatically. Because the retraining pipeline is versioned and comparison metrics are retained, the team can see that new data treatment altered a feature unexpectedly. They reject the new model, keep the current version live, and fix the data preparation step before trying again. Without those MLOps controls, the change might have been promoted unnoticed.

A mid-sized software company adds an ML-based recommendation service to an existing product. The model itself is only part of the system. The production service also depends on APIs, cached features, user permissions, and a front-end experience. The company uses lightweight MLOps habits: model registry, controlled deployment, documentation of model purpose, clear service ownership, rollback procedures, and monthly review of model value against operating cost. It does not need a giant platform to benefit; it needs repeatability and accountability.

Common misunderstandings

One common misunderstanding is that MLOps begins when the model is deployed. That is too late. If data lineage, experiment tracking, evaluation logic, and documentation are missing during development, the organisation is already storing up operational debt.

Another misunderstanding is that MLOps is the same as buying an ML platform. Platforms can be helpful, but weak ownership and weak process do not disappear because a product has a model registry. If no one defines retraining rules, monitors drift, or records approval criteria, the risk remains.

It is also wrong to treat MLOps as something only relevant to firms building advanced proprietary models. Many organisations use third-party frameworks, modest tabular models, or pre-built ML services. They still need MLOps habits if those systems affect real operations. The scale may be smaller, but the discipline still matters.

Finally, MLOps is not the same as compliance. Good MLOps can support better governance, auditability, and security, but it does not make an organisation compliant by itself. Leaders still need appropriate controls, policies, and legal interpretation where required.

Risks and boundaries

The biggest MLOps risks usually come from unmanaged change. Data can drift. Business context can drift. Thresholds can be tweaked without review. Training data can be refreshed without proving that it should be. Feature logic can diverge between training and serving. Teams can lose track of which model version is live. All of these create operational weakness long before a formal incident is declared.

There is also a documentation risk. If a model's purpose, owner, data dependencies, evaluation criteria, and rollback plan are not written down, the organisation becomes dependent on memory. That is fragile even in calm periods and dangerous during incidents or staff turnover. Hidden technical debt is especially common in ML systems because the visible model is only one part of a much larger behaviour chain.

Security and data handling matter too. Training data and inference data may include personal data or commercially sensitive information. Access control, least privilege, redaction, retention decisions, and suitable DLP boundaries still apply. A model pipeline can also become a security and integrity concern if external dependencies, packages, or model artefacts are weakly controlled.

The final boundary is strategic. MLOps helps you run ML systems well, but it does not answer whether the use case should exist, whether the model is ethically appropriate, or whether the economic case remains sound. Those questions connect to governance and TCO. A mature view treats MLOps as one operating layer inside a wider decision model, not as a substitute for product judgement.

What to do next

If you are starting from scratch, do not begin with a platform comparison. Begin with one live or planned ML use case and ask a harder set of questions. What data feeds it? Who owns the model? How is performance checked before release? What is monitored afterwards? When would you retrain? What would make you roll back? Where is the documentation?

If those answers are vague, create a lightweight operating minimum. Put training and scoring code in version control. Record the dataset or dataset reference used for training. Define evaluation thresholds. Store model versions somewhere controlled. Add live monitoring for service health and at least one meaningful model-quality signal. Write down the owner, escalation path, and fallback.

Then decide what can stay lightweight and what needs to mature. A low-impact internal model may only need modest controls. A model that shapes customer treatment, pricing, operational priority, or risk decisions usually needs more disciplined review, logging, and governance. The aim is not to make ML bureaucratic. The aim is to make it operable.

Have a question or a suggestion, or want to understand how we research and review these guides? Read about our editorial standards and how to reach us.

FAQs

Is MLOps just DevOps for data scientists?

Not really. MLOps overlaps with DevOps because both rely on automation, version control, testing, deployment discipline, and monitoring. But MLOps has extra concerns that ordinary software delivery does not fully cover, such as data lineage, training reproducibility, drift detection, evaluation sets, controlled retraining, and model performance in changing real-world conditions. It is better understood as an adjacent discipline than as a simple rebrand.

Do you need MLOps if you use simple machine learning models?

Usually yes, at least in lightweight form. A straightforward regression or classification model can still cause business problems if it is retrained badly, fed the wrong inputs, or shipped without monitoring. Smaller organisations may not need a full MLOps platform, but they do benefit from basic controls such as dataset versioning, documented ownership, release checks, and a plan for rollback or fallback when behaviour changes.

How often should a model be retrained?

There is no single correct schedule. Some models benefit from regular retraining because their environment changes quickly. Others become worse if retrained too often with unstable or noisy data. The safer question is not "how often can we retrain?" but "what evidence tells us retraining is justified?" MLOps helps by treating retraining as a controlled operational change with validation, approval criteria, and comparison against the current model.

Where does MLOps stop and governance begin?

They overlap, but they are not identical. MLOps focuses on the operating discipline needed to run ML systems reliably: data handling, training pipelines, model promotion, monitoring, rollback, and ownership. Governance is wider. It asks whether the use case is appropriate, what risks and controls are acceptable, what oversight is needed, and how AI fits the organisation's policies and obligations. Good MLOps supports governance, but it does not replace it.

Sources

AI Risk Management Framework (NIST). Governance framing, trustworthy AI lifecycle context, and the connection between operating discipline and AI risk management.
NIST AI RMF Playbook (NIST). Operationalising AI RMF outcomes in design, development, deployment, and use.
Secure Software Development Framework (NIST). Secure development, change control, and software lifecycle discipline relevant to ML systems.
Secure Software Development Practices for Generative AI and Dual-Use Foundation Models | NIST SP 800-218A (NIST). Secure development practices specific to AI models and AI systems.
Guidelines for secure AI system development (National Cyber Security Centre). Secure design, deployment, operation, maintenance, and ownership expectations for AI systems.
Secure operation and maintenance (National Cyber Security Centre). Logging, monitoring, updates, and lessons-learned expectations after deployment.

‹ What is UK GDPR in AI work?

What is LLMOps? ›