A Guide to Monitoring Machine Learning Models in Production

Abstract

An NVIDIA Developer Blog overview of production ML model monitoring, arguing that monitoring must start before deployment and that ML systems require different monitoring approaches than traditional software. The article distinguishes two stakeholder perspectives — data scientists (concerned with functional objectives: data quality, model accuracy, prediction validity) and engineers (concerned with operational objectives: latency, memory, uptime) — and argues both perspectives must be synthesized in any adequate monitoring strategy. Monitoring is structured across two levels: functional (input data quality and drift, model drift and versioning, output predictions and ground truth) and operational (system performance metrics, pipeline health, cost tracking). Three failure modes specific to ML systems are highlighted: entanglements (any feature change affects all predictions), configuration sensitivity (incorrect hyperparameters/versions produce valid-looking but wrong outputs without raising exceptions), and the responsibility challenge (multiple stakeholders with different monitoring vocabularies). Tooling covered includes Prometheus + Grafana (time-series metrics and dashboards, including NVIDIA Triton’s native Prometheus export), Evidently AI (open-source drift and data quality analysis), and Amazon SageMaker Model Monitor (managed monitoring with no-code and custom analysis modes).


Key Concepts

  • Functional vs Operational Monitoring: Functional monitoring tracks data, model, and output correctness (data scientist’s domain); operational monitoring tracks system resources, pipeline health, and cost (engineer’s domain). Adequate monitoring requires both
  • Data Drift: Changes in the statistical distribution of production input features relative to the training distribution; detected via statistical tests on feature value distributions over time; signals that the model may need retraining
  • Model Drift: Decay of a model’s predictive power due to real-world environment changes; requires monitoring predictive performance metrics over time; detected with statistical tests
  • Prediction Drift: Used as a proxy when ground-truth labels are unavailable — a sudden distributional shift in model outputs signals that something has gone wrong upstream (data change, pipeline failure, or genuine environmental shift)
  • Ground Truth Labels: When obtainable (e.g. click/no-click outcomes), model predictions can be evaluated against actuals; most ML use cases cannot obtain ground truth promptly, necessitating proxy metrics
  • Entanglement: The ML-specific failure mode where changing any input feature distribution affects the approximation of the target function and potentially all predictions — “changing anything changes everything”
  • Configuration Sensitivity: A machine learning system can produce incorrect-but-valid outputs without raising exceptions, unlike traditional software — a misconfigured hyperparameter or wrong model version silently degrades performance
  • Prometheus + Grafana: Open-source event monitoring (Prometheus) and visualization (Grafana) stack; NVIDIA Triton Inference Server exports GPU/CPU use, memory, and latency metrics natively in Prometheus format
  • Evidently AI: Open-source Python library for monitoring and analyzing ML models in production; supports drift detection and data quality reports

Key Claims and Findings

  • ML system monitoring is harder than traditional software monitoring because behavior depends on data and model (not just code), and a system can produce incorrect but structurally valid outputs without error
  • Monitoring should begin during model development and experimentation — not only post-deployment; establishing baselines early improves production monitoring quality
  • Sudden major performance degradation is a red flag requiring immediate investigation; gradual degradation is expected and can be managed with retraining schedules
  • A weak monitoring system leads to: (1) poor-performing models left in production undetected; (2) models that no longer deliver business value; (3) uncaught bugs that compound over time
  • When ground truth is unavailable, prediction drift is a valid proxy for model quality monitoring

Two-Level Monitoring Framework

Functional Level

CategoryWhat to Monitor
Input dataData quality (type consistency, schema validity), data drift (feature distribution shifts)
ModelModel drift (predictive power decay), correct version in production
OutputGround truth comparison (when available), prediction drift (when ground truth is unavailable)

Operational Level

CategoryWhat to Monitor
System performanceMemory use, latency, CPU/GPU utilization
PipelinesData pipeline health (failures, missing data), model pipeline dependencies
CostsCloud resource usage, training and inference costs; set budget alerts

Tooling Summary

ToolTypeKey Use
PrometheusOpen-sourceTime-series metrics scraping and alerting; NVIDIA Triton exports natively
GrafanaOpen-sourceVisualization dashboards on top of Prometheus data
Evidently AIOpen-source PythonData drift detection, data quality reports, model performance analysis
Amazon SageMaker Model MonitorManagedAutomated model quality monitoring with alerting; no-code and custom modes

Best Practices

  1. Monitoring starts during development, not deployment
  2. Major performance degradation warrants immediate investigation; gradual decline triggers scheduled retraining
  3. Document a troubleshooting framework so teams move from alert to action systematically
  4. Have a break-glass plan of action ready before failures occur
  5. Use prediction drift or proxy metrics when ground truth labels are unavailable in real time

Terminology

  • Data Drift: Statistical distribution shift in production input features relative to training data
  • Model Drift: Decay of predictive power over time due to real-world changes; distinct from data drift (which is a cause; model drift is the effect)
  • Prediction Drift: Distributional shift in model outputs; a proxy for model quality when ground truth is unavailable
  • KPI (Key Performance Indicator): Business-level metric used to evaluate whether model outputs meet operational targets
  • Proxy Metric: A measurable substitute used when the true metric (e.g. ground truth accuracy) is unavailable or delayed

Connections to Existing Wiki Pages

  • Observability Concepts (LangSmith) — LangSmith’s traces/runs/feedback hierarchy is an LLM-agent-specific implementation of the functional monitoring layer described here; the two articles are complementary in scope
  • AI Agent Evaluation — Summary — agent-specific evaluation metrics (task completion, tool-call accuracy, LLM-as-a-judge) extend the general ML monitoring framework here to the agentic domain
  • Monitoring ML Models: Data Quality and Integrity — a companion article (by Evidently AI) focused specifically on input data quality monitoring, covering the data processing issues, schema changes, and data loss categories introduced here
  • Triton Inference Server Backend — NVIDIA Triton exports system metrics (GPU/CPU use, memory, latency) natively in Prometheus format, directly enabling the operational-level monitoring stack described here