A Guide to Monitoring Machine Learning Models in Production
Abstract
An NVIDIA Developer Blog overview of production ML model monitoring, arguing that monitoring must start before deployment and that ML systems require different monitoring approaches than traditional software. The article distinguishes two stakeholder perspectives — data scientists (concerned with functional objectives: data quality, model accuracy, prediction validity) and engineers (concerned with operational objectives: latency, memory, uptime) — and argues both perspectives must be synthesized in any adequate monitoring strategy. Monitoring is structured across two levels: functional (input data quality and drift, model drift and versioning, output predictions and ground truth) and operational (system performance metrics, pipeline health, cost tracking). Three failure modes specific to ML systems are highlighted: entanglements (any feature change affects all predictions), configuration sensitivity (incorrect hyperparameters/versions produce valid-looking but wrong outputs without raising exceptions), and the responsibility challenge (multiple stakeholders with different monitoring vocabularies). Tooling covered includes Prometheus + Grafana (time-series metrics and dashboards, including NVIDIA Triton’s native Prometheus export), Evidently AI (open-source drift and data quality analysis), and Amazon SageMaker Model Monitor (managed monitoring with no-code and custom analysis modes).
Key Concepts
Functional vs Operational Monitoring: Functional monitoring tracks data, model, and output correctness (data scientist’s domain); operational monitoring tracks system resources, pipeline health, and cost (engineer’s domain). Adequate monitoring requires both
Data Drift: Changes in the statistical distribution of production input features relative to the training distribution; detected via statistical tests on feature value distributions over time; signals that the model may need retraining
Model Drift: Decay of a model’s predictive power due to real-world environment changes; requires monitoring predictive performance metrics over time; detected with statistical tests
Prediction Drift: Used as a proxy when ground-truth labels are unavailable — a sudden distributional shift in model outputs signals that something has gone wrong upstream (data change, pipeline failure, or genuine environmental shift)
Ground Truth Labels: When obtainable (e.g. click/no-click outcomes), model predictions can be evaluated against actuals; most ML use cases cannot obtain ground truth promptly, necessitating proxy metrics
Entanglement: The ML-specific failure mode where changing any input feature distribution affects the approximation of the target function and potentially all predictions — “changing anything changes everything”
Configuration Sensitivity: A machine learning system can produce incorrect-but-valid outputs without raising exceptions, unlike traditional software — a misconfigured hyperparameter or wrong model version silently degrades performance
Prometheus + Grafana: Open-source event monitoring (Prometheus) and visualization (Grafana) stack; NVIDIA Triton Inference Server exports GPU/CPU use, memory, and latency metrics natively in Prometheus format
Evidently AI: Open-source Python library for monitoring and analyzing ML models in production; supports drift detection and data quality reports
Key Claims and Findings
ML system monitoring is harder than traditional software monitoring because behavior depends on data and model (not just code), and a system can produce incorrect but structurally valid outputs without error
Monitoring should begin during model development and experimentation — not only post-deployment; establishing baselines early improves production monitoring quality
Sudden major performance degradation is a red flag requiring immediate investigation; gradual degradation is expected and can be managed with retraining schedules
A weak monitoring system leads to: (1) poor-performing models left in production undetected; (2) models that no longer deliver business value; (3) uncaught bugs that compound over time
When ground truth is unavailable, prediction drift is a valid proxy for model quality monitoring
Two-Level Monitoring Framework
Functional Level
Category
What to Monitor
Input data
Data quality (type consistency, schema validity), data drift (feature distribution shifts)
Model
Model drift (predictive power decay), correct version in production
Output
Ground truth comparison (when available), prediction drift (when ground truth is unavailable)
Operational Level
Category
What to Monitor
System performance
Memory use, latency, CPU/GPU utilization
Pipelines
Data pipeline health (failures, missing data), model pipeline dependencies
Costs
Cloud resource usage, training and inference costs; set budget alerts
Tooling Summary
Tool
Type
Key Use
Prometheus
Open-source
Time-series metrics scraping and alerting; NVIDIA Triton exports natively
Grafana
Open-source
Visualization dashboards on top of Prometheus data
Evidently AI
Open-source Python
Data drift detection, data quality reports, model performance analysis
Amazon SageMaker Model Monitor
Managed
Automated model quality monitoring with alerting; no-code and custom modes
Best Practices
Monitoring starts during development, not deployment
Document a troubleshooting framework so teams move from alert to action systematically
Have a break-glass plan of action ready before failures occur
Use prediction drift or proxy metrics when ground truth labels are unavailable in real time
Terminology
Data Drift: Statistical distribution shift in production input features relative to training data
Model Drift: Decay of predictive power over time due to real-world changes; distinct from data drift (which is a cause; model drift is the effect)
Prediction Drift: Distributional shift in model outputs; a proxy for model quality when ground truth is unavailable
KPI (Key Performance Indicator): Business-level metric used to evaluate whether model outputs meet operational targets
Proxy Metric: A measurable substitute used when the true metric (e.g. ground truth accuracy) is unavailable or delayed
Connections to Existing Wiki Pages
Observability Concepts (LangSmith) — LangSmith’s traces/runs/feedback hierarchy is an LLM-agent-specific implementation of the functional monitoring layer described here; the two articles are complementary in scope
AI Agent Evaluation — Summary — agent-specific evaluation metrics (task completion, tool-call accuracy, LLM-as-a-judge) extend the general ML monitoring framework here to the agentic domain
Monitoring ML Models: Data Quality and Integrity — a companion article (by Evidently AI) focused specifically on input data quality monitoring, covering the data processing issues, schema changes, and data loss categories introduced here
Triton Inference Server Backend — NVIDIA Triton exports system metrics (GPU/CPU use, memory, latency) natively in Prometheus format, directly enabling the operational-level monitoring stack described here