Monitoring ML Models in Production: Data Quality and Integrity

Abstract

A Towards Data Science article by Evidently AI co-founders Emeli Dral and Elena Samuylova covering input data quality and integrity as the first line of defense for production ML monitoring. The article frames data issues in two categories — failures caused by internal pipeline problems, and changes caused by legitimate environmental evolution — and enumerates four root causes: data processing failures (broken pipelines, bad SQL, feature code bugs), data schema changes (new/renamed/reordered fields, altered data types), data loss at the source (sensor failures, API outages, logging bugs), and broken upstream models (cascading errors in model pipelines where one model’s bad output is another’s corrupted input). The monitoring framework then prescribes five specific checks: model call volume (basic liveness check), data schema validation (feature set completeness and type consistency), missing data detection (per-feature share against acceptable thresholds, with key-feature priority tiers), feature value range and statistics (numerical bounds, quantile analysis, categorical frequency), and per-step feature processing validation (isolating corruption to specific pipeline stages). The core argument is that ML-specific data quality monitoring must focus on the precise data slice consumed by a given model, not aggregate warehouse quality, and that catching data issues early enables model updates or pauses before performance degradation becomes visible.

Key Concepts

Data Quality as First Line of Defense: Input data problems cause model failures silently — the model may run and return outputs without exceptions while operating on corrupted input; data quality monitoring catches these before model performance metrics degrade
Data Processing Failures: Pipeline-level errors including wrong data source versions, lost access permissions, bad SQL queries, infrastructure updates breaking hard-coded column names, and feature code corner cases; may cause outright crashes or, worse, silent incorrect execution via try/except clauses
Data Schema Changes: Legitimate business or operational updates to data source structure — new fields, renamed columns, changed data types, reordered fields, altered categorical hierarchies — that invalidate model feature assumptions without any error signal
Data Loss at Source: Physical sensor failures, API outages, or logging bugs that reduce or eliminate input data; corrupted sources (e.g. sensors returning a constant stale value) are harder to detect than complete outages
Broken Upstream Models: In model pipelines where one model’s output feeds another, a corrupted upstream prediction propagates as a corrupted feature; produces interconnected failure loops
Key-Feature Priority Tiers: Monitoring policy where critical features (by feature importance or domain knowledge) require presence to run the model; auxiliary features are noted but are not show-stoppers
Per-Step Pipeline Validation: Validating inputs and outputs at each transformation step in complex pipelines to localize the corruption source, rather than only checking the final feature vector

Key Claims and Findings

Data quality monitoring is the most universally applicable monitoring check — it should be implemented for every model, regardless of complexity, as a basic health check comparable to latency monitoring
The goal is to monitor the specific data slice consumed by a given model, not general warehouse data quality — even 99% correct warehouse data may mask the 1% that feeds your model
Missing values appear in many forms: empty, “N/A,” “NaN,” “999,” “unknown” — simplistic null checks miss encoded missing values; exhaustive scanning against standard missing-value expressions is required
A broken upstream model (corrupted prediction → corrupted input feature) creates self-reinforcing failure loops in recommendation and routing systems; early detection requires monitoring call volumes at every pipeline node
Catching data issues early allows recovery: if the source cannot be fixed immediately, the model can be paused or replaced with a fallback rather than continuing to generate wrong outputs

Five Data Quality Monitoring Checks

1. Model Call Volume

Track number of model requests and responses separately. A drop to zero indicates service failure. Divergence between requests and responses reveals timeout/fallback patterns. Only useful when a “normal” usage pattern exists.

2. Data Schema Validation

Perform feature-by-feature checks: feature count (missing or extra columns), data type consistency (categorical vs numerical), and field naming. Goal: confirm incoming dataset shape matches training expectations.

3. Missing Data Detection

Track missing-value share per feature, compared against acceptable baseline thresholds. Scan for all standard missing-value encodings (empty, “N/A,” “NaN,” “999,” “unknown”). Distinguish critical features (model cannot run without them) from auxiliary features (absence is noted, not a blocker).

4. Feature Value Range and Statistics

For numerical features: min/max range bounds and key statistics (mean, quantiles). Quantile analysis catches sensors stuck at a constant value even when the value is technically in-range. For categorical features: distribution of category frequencies; flag novel categories not seen in training.

5. Feature Processing Validation

In complex, multi-step pipelines: validate inputs and outputs at each transformation step, not only the final output. Isolates whether corruption originates in source data or in feature calculation code.

Root Cause Taxonomy

Category	Example	Detection
Wrong source version	Pipeline points to outdated table	Schema / version check
Lost access	Table moved, permissions not updated	Call volume drops to zero
Bad query	Edge-case SQL fails (timezone, null join)	Schema mismatch on specific features
Infrastructure update	Column names lowercased, spaces→underscores	Schema validation failure
Broken feature code	Corner-case input triggers wrong calculation	Feature value range violation
Schema change	Categorical hierarchy reorganized	New category values / type mismatch
Source data loss	Sensor failure, API outage	Missing data spike, call volume drop
Broken upstream model	Recommender feeds bad scores to ranker	Prediction drift in downstream model

Terminology

SHAP / Shapley Values: Game-theoretic method for attributing model predictions to input features; useful for ranking feature importance when building key-feature priority tiers
Data Validation Threshold: Per-feature acceptable missing-value share above which the model is paused or a fallback is used
Fallback Mechanism: A simpler rule-based or static response used when the model cannot run due to data quality failures
Evidently AI: Open-source Python library (by the article authors) providing drift detection and data quality reports for production ML systems
Batch vs Streaming Inference: Batch inference (periodic bulk scoring) has more tolerance for pipeline failures — if caught early, runs can be repeated; streaming inference requires near-real-time detection

Connections to Existing Wiki Pages

A Guide to Monitoring Machine Learning Models in Production — the companion NVIDIA article that frames the broader functional/operational monitoring landscape; this article provides the detailed implementation of the input data monitoring component described there
AI Agent Evaluation — Summary — the “parameter value grounding” and “hallucinated parameter” function-calling metrics described there are agent-level manifestations of the same data integrity failures covered here: the agent receives or generates inputs that do not match expected schemas or value ranges
Observability Concepts (LangSmith) — LangSmith’s run-level metadata and feedback are the LLM application equivalent of the per-step pipeline validation checks described here; both approaches aim to localize data corruption to specific pipeline stages

Personal Wiki

Explorer

Monitoring ML Models in Production: Data Quality and Integrity

Monitoring ML Models in Production: Data Quality and Integrity

Abstract

Key Concepts

Key Claims and Findings

Five Data Quality Monitoring Checks

1. Model Call Volume

2. Data Schema Validation

3. Missing Data Detection

4. Feature Value Range and Statistics

5. Feature Processing Validation

Root Cause Taxonomy

Terminology

Connections to Existing Wiki Pages

Graph View

Table of Contents

Backlinks