A Guide to Monitoring Machine Learning Models in Production

Abstract

An NVIDIA Developer Blog overview of production ML model monitoring, arguing that monitoring must start before deployment and that ML systems require different monitoring approaches than traditional software. The article distinguishes two stakeholder perspectives — data scientists (concerned with functional objectives: data quality, model accuracy, prediction validity) and engineers (concerned with operational objectives: latency, memory, uptime) — and argues both perspectives must be synthesized in any adequate monitoring strategy. Monitoring is structured across two levels: functional (input data quality and drift, model drift and versioning, output predictions and ground truth) and operational (system performance metrics, pipeline health, cost tracking). Three failure modes specific to ML systems are highlighted: entanglements (any feature change affects all predictions), configuration sensitivity (incorrect hyperparameters/versions produce valid-looking but wrong outputs without raising exceptions), and the responsibility challenge (multiple stakeholders with different monitoring vocabularies). Tooling covered includes Prometheus + Grafana (time-series metrics and dashboards, including NVIDIA Triton’s native Prometheus export), Evidently AI (open-source drift and data quality analysis), and Amazon SageMaker Model Monitor (managed monitoring with no-code and custom analysis modes).

Key Concepts

Functional vs Operational Monitoring: Functional monitoring tracks data, model, and output correctness (data scientist’s domain); operational monitoring tracks system resources, pipeline health, and cost (engineer’s domain). Adequate monitoring requires both
Data Drift: Changes in the statistical distribution of production input features relative to the training distribution; detected via statistical tests on feature value distributions over time; signals that the model may need retraining
Model Drift: Decay of a model’s predictive power due to real-world environment changes; requires monitoring predictive performance metrics over time; detected with statistical tests
Prediction Drift: Used as a proxy when ground-truth labels are unavailable — a sudden distributional shift in model outputs signals that something has gone wrong upstream (data change, pipeline failure, or genuine environmental shift)
Ground Truth Labels: When obtainable (e.g. click/no-click outcomes), model predictions can be evaluated against actuals; most ML use cases cannot obtain ground truth promptly, necessitating proxy metrics
Entanglement: The ML-specific failure mode where changing any input feature distribution affects the approximation of the target function and potentially all predictions — “changing anything changes everything”
Configuration Sensitivity: A machine learning system can produce incorrect-but-valid outputs without raising exceptions, unlike traditional software — a misconfigured hyperparameter or wrong model version silently degrades performance
Prometheus + Grafana: Open-source event monitoring (Prometheus) and visualization (Grafana) stack; NVIDIA Triton Inference Server exports GPU/CPU use, memory, and latency metrics natively in Prometheus format
Evidently AI: Open-source Python library for monitoring and analyzing ML models in production; supports drift detection and data quality reports

Key Claims and Findings

ML system monitoring is harder than traditional software monitoring because behavior depends on data and model (not just code), and a system can produce incorrect but structurally valid outputs without error
Monitoring should begin during model development and experimentation — not only post-deployment; establishing baselines early improves production monitoring quality
Sudden major performance degradation is a red flag requiring immediate investigation; gradual degradation is expected and can be managed with retraining schedules
A weak monitoring system leads to: (1) poor-performing models left in production undetected; (2) models that no longer deliver business value; (3) uncaught bugs that compound over time
When ground truth is unavailable, prediction drift is a valid proxy for model quality monitoring

Two-Level Monitoring Framework

Functional Level

Category	What to Monitor
Input data	Data quality (type consistency, schema validity), data drift (feature distribution shifts)
Model	Model drift (predictive power decay), correct version in production
Output	Ground truth comparison (when available), prediction drift (when ground truth is unavailable)

Operational Level

Category	What to Monitor
System performance	Memory use, latency, CPU/GPU utilization
Pipelines	Data pipeline health (failures, missing data), model pipeline dependencies
Costs	Cloud resource usage, training and inference costs; set budget alerts

Tooling Summary

Tool	Type	Key Use
Prometheus	Open-source	Time-series metrics scraping and alerting; NVIDIA Triton exports natively
Grafana	Open-source	Visualization dashboards on top of Prometheus data
Evidently AI	Open-source Python	Data drift detection, data quality reports, model performance analysis
Amazon SageMaker Model Monitor	Managed	Automated model quality monitoring with alerting; no-code and custom modes

Best Practices

Monitoring starts during development, not deployment
Major performance degradation warrants immediate investigation; gradual decline triggers scheduled retraining
Document a troubleshooting framework so teams move from alert to action systematically
Have a break-glass plan of action ready before failures occur
Use prediction drift or proxy metrics when ground truth labels are unavailable in real time

Terminology

Data Drift: Statistical distribution shift in production input features relative to training data
Model Drift: Decay of predictive power over time due to real-world changes; distinct from data drift (which is a cause; model drift is the effect)
Prediction Drift: Distributional shift in model outputs; a proxy for model quality when ground truth is unavailable
KPI (Key Performance Indicator): Business-level metric used to evaluate whether model outputs meet operational targets
Proxy Metric: A measurable substitute used when the true metric (e.g. ground truth accuracy) is unavailable or delayed

Connections to Existing Wiki Pages

Observability Concepts (LangSmith) — LangSmith’s traces/runs/feedback hierarchy is an LLM-agent-specific implementation of the functional monitoring layer described here; the two articles are complementary in scope
AI Agent Evaluation — Summary — agent-specific evaluation metrics (task completion, tool-call accuracy, LLM-as-a-judge) extend the general ML monitoring framework here to the agentic domain
Monitoring ML Models: Data Quality and Integrity — a companion article (by Evidently AI) focused specifically on input data quality monitoring, covering the data processing issues, schema changes, and data loss categories introduced here
Triton Inference Server Backend — NVIDIA Triton exports system metrics (GPU/CPU use, memory, latency) natively in Prometheus format, directly enabling the operational-level monitoring stack described here

Personal Wiki

Explorer

A Guide to Monitoring Machine Learning Models in Production

A Guide to Monitoring Machine Learning Models in Production

Abstract

Key Concepts

Key Claims and Findings

Two-Level Monitoring Framework

Functional Level

Operational Level

Tooling Summary

Best Practices

Terminology

Connections to Existing Wiki Pages

Graph View

Table of Contents

Backlinks