Monitoring ML Models in Production: Data Quality and Integrity

Abstract

A Towards Data Science article by Evidently AI co-founders Emeli Dral and Elena Samuylova covering input data quality and integrity as the first line of defense for production ML monitoring. The article frames data issues in two categories — failures caused by internal pipeline problems, and changes caused by legitimate environmental evolution — and enumerates four root causes: data processing failures (broken pipelines, bad SQL, feature code bugs), data schema changes (new/renamed/reordered fields, altered data types), data loss at the source (sensor failures, API outages, logging bugs), and broken upstream models (cascading errors in model pipelines where one model’s bad output is another’s corrupted input). The monitoring framework then prescribes five specific checks: model call volume (basic liveness check), data schema validation (feature set completeness and type consistency), missing data detection (per-feature share against acceptable thresholds, with key-feature priority tiers), feature value range and statistics (numerical bounds, quantile analysis, categorical frequency), and per-step feature processing validation (isolating corruption to specific pipeline stages). The core argument is that ML-specific data quality monitoring must focus on the precise data slice consumed by a given model, not aggregate warehouse quality, and that catching data issues early enables model updates or pauses before performance degradation becomes visible.


Key Concepts

  • Data Quality as First Line of Defense: Input data problems cause model failures silently — the model may run and return outputs without exceptions while operating on corrupted input; data quality monitoring catches these before model performance metrics degrade
  • Data Processing Failures: Pipeline-level errors including wrong data source versions, lost access permissions, bad SQL queries, infrastructure updates breaking hard-coded column names, and feature code corner cases; may cause outright crashes or, worse, silent incorrect execution via try/except clauses
  • Data Schema Changes: Legitimate business or operational updates to data source structure — new fields, renamed columns, changed data types, reordered fields, altered categorical hierarchies — that invalidate model feature assumptions without any error signal
  • Data Loss at Source: Physical sensor failures, API outages, or logging bugs that reduce or eliminate input data; corrupted sources (e.g. sensors returning a constant stale value) are harder to detect than complete outages
  • Broken Upstream Models: In model pipelines where one model’s output feeds another, a corrupted upstream prediction propagates as a corrupted feature; produces interconnected failure loops
  • Key-Feature Priority Tiers: Monitoring policy where critical features (by feature importance or domain knowledge) require presence to run the model; auxiliary features are noted but are not show-stoppers
  • Per-Step Pipeline Validation: Validating inputs and outputs at each transformation step in complex pipelines to localize the corruption source, rather than only checking the final feature vector

Key Claims and Findings

  • Data quality monitoring is the most universally applicable monitoring check — it should be implemented for every model, regardless of complexity, as a basic health check comparable to latency monitoring
  • The goal is to monitor the specific data slice consumed by a given model, not general warehouse data quality — even 99% correct warehouse data may mask the 1% that feeds your model
  • Missing values appear in many forms: empty, “N/A,” “NaN,” “999,” “unknown” — simplistic null checks miss encoded missing values; exhaustive scanning against standard missing-value expressions is required
  • A broken upstream model (corrupted prediction → corrupted input feature) creates self-reinforcing failure loops in recommendation and routing systems; early detection requires monitoring call volumes at every pipeline node
  • Catching data issues early allows recovery: if the source cannot be fixed immediately, the model can be paused or replaced with a fallback rather than continuing to generate wrong outputs

Five Data Quality Monitoring Checks

1. Model Call Volume

Track number of model requests and responses separately. A drop to zero indicates service failure. Divergence between requests and responses reveals timeout/fallback patterns. Only useful when a “normal” usage pattern exists.

2. Data Schema Validation

Perform feature-by-feature checks: feature count (missing or extra columns), data type consistency (categorical vs numerical), and field naming. Goal: confirm incoming dataset shape matches training expectations.

3. Missing Data Detection

Track missing-value share per feature, compared against acceptable baseline thresholds. Scan for all standard missing-value encodings (empty, “N/A,” “NaN,” “999,” “unknown”). Distinguish critical features (model cannot run without them) from auxiliary features (absence is noted, not a blocker).

4. Feature Value Range and Statistics

For numerical features: min/max range bounds and key statistics (mean, quantiles). Quantile analysis catches sensors stuck at a constant value even when the value is technically in-range. For categorical features: distribution of category frequencies; flag novel categories not seen in training.

5. Feature Processing Validation

In complex, multi-step pipelines: validate inputs and outputs at each transformation step, not only the final output. Isolates whether corruption originates in source data or in feature calculation code.


Root Cause Taxonomy

CategoryExampleDetection
Wrong source versionPipeline points to outdated tableSchema / version check
Lost accessTable moved, permissions not updatedCall volume drops to zero
Bad queryEdge-case SQL fails (timezone, null join)Schema mismatch on specific features
Infrastructure updateColumn names lowercased, spaces→underscoresSchema validation failure
Broken feature codeCorner-case input triggers wrong calculationFeature value range violation
Schema changeCategorical hierarchy reorganizedNew category values / type mismatch
Source data lossSensor failure, API outageMissing data spike, call volume drop
Broken upstream modelRecommender feeds bad scores to rankerPrediction drift in downstream model

Terminology

  • SHAP / Shapley Values: Game-theoretic method for attributing model predictions to input features; useful for ranking feature importance when building key-feature priority tiers
  • Data Validation Threshold: Per-feature acceptable missing-value share above which the model is paused or a fallback is used
  • Fallback Mechanism: A simpler rule-based or static response used when the model cannot run due to data quality failures
  • Evidently AI: Open-source Python library (by the article authors) providing drift detection and data quality reports for production ML systems
  • Batch vs Streaming Inference: Batch inference (periodic bulk scoring) has more tolerance for pipeline failures — if caught early, runs can be repeated; streaming inference requires near-real-time detection

Connections to Existing Wiki Pages

  • A Guide to Monitoring Machine Learning Models in Production — the companion NVIDIA article that frames the broader functional/operational monitoring landscape; this article provides the detailed implementation of the input data monitoring component described there
  • AI Agent Evaluation — Summary — the “parameter value grounding” and “hallucinated parameter” function-calling metrics described there are agent-level manifestations of the same data integrity failures covered here: the agent receives or generates inputs that do not match expected schemas or value ranges
  • Observability Concepts (LangSmith) — LangSmith’s run-level metadata and feedback are the LLM application equivalent of the per-step pipeline validation checks described here; both approaches aim to localize data corruption to specific pipeline stages