AI Agents in Production: Observability & Evaluation

Abstract

Lesson 10 from Microsoft’s open-source “AI Agents for Beginners” course covers the operational shift from experimental prototypes to production-ready AI agents. The article frames observability — the practice of emitting and analyzing traces and spans — as the mechanism that converts “black box” agents into “glass boxes,” and argues it is not a nice-to-have but a critical production capability. Four production use cases are given for observability: debugging and root-cause analysis, latency and cost management, trust/safety/compliance auditing, and feeding continuous improvement loops. The article then catalogs the key metrics to track (latency, cost, error rates, user feedback, accuracy, automated scores), explains instrumentation via OpenTelemetry and the Microsoft Agent Framework, and defines and contrasts offline evaluation (controlled datasets, repeatable, ground truth available) against online evaluation (live traffic, catches model drift, relies on implicit/explicit feedback). It closes with a common-issues debugging table covering loop detection, tool failures, and multi-agent inconsistency, plus three cost-management strategies.


Key Concepts

  • Traces and Spans: Core observability primitives. A trace represents a complete agent task from start to finish (e.g. handling a user query). A span is an individual step within the trace (e.g. one LLM call, one tool invocation, one retrieval). Together they make the internal state and reasoning of an agent inspectable — the “glass box” model.
  • Offline Evaluation: Evaluating the agent in a controlled setting against curated test datasets with known correct answers. Repeatable, supports clear accuracy metrics, can be integrated into CI/CD pipelines. Key challenge: test sets must be kept updated with real-world edge cases or the agent may perform well on fixed benchmarks but fail in production.
  • Online Evaluation: Evaluating the agent on live real-world traffic. Captures model drift over time, surfaces unexpected query types, and provides a true picture of production behavior. Relies on explicit user feedback (ratings, comments) and implicit signals (query rephrasing, retries, click behavior). Can include A/B tests or shadow runs.
  • The Evaluation Loop: The two modes are complementary and form a cycle: evaluate offline → deploy → monitor online → collect new failure cases → add to offline dataset → refine agent → repeat
  • Key Metrics to Track:
    • Latency — response time per task and per step; identify bottlenecks (e.g. 5 sequential LLM calls could be parallelized)
    • Cost — expense per agent run; track token counts and API call frequency to catch runaway loops or over-engineered prompt chains
    • Request error rate — API errors, failed tool calls; drive fallback/retry logic design
    • Explicit user feedback — thumbs up/down, star ratings, comments
    • Implicit user feedback — query rephrasing, retry-button clicks, repeated queries
    • Accuracy — task completion labels, automated scores, success/failure marks on traces
    • Automated evaluation metrics — LLM-as-judge scores (e.g. RAGAS for RAG agents, LLM Guard for prompt injection / harmful content)
  • OpenTelemetry (OTel): Industry-standard APIs, SDKs, and tools for generating, collecting, and exporting telemetry data. Has emerged as the standard for LLM observability. Microsoft Agent Framework integrates OTel natively — span creation is automatic, with hooks for custom span attributes (user_id, session_id, model_version).
  • Cost Management Strategies:
    1. Smaller models (SLMs): Use for simpler tasks (intent classification, parameter extraction); reserve large models for complex reasoning. Build an evaluation system to compare SLM vs LLM performance per task type.
    2. Router model: A lightweight LLM/SLM/serverless function classifies request complexity and dispatches to the right model tier — reduces cost while maintaining quality on demanding tasks.
    3. Response caching: Pre-computed responses for common requests bypass the full agent pipeline entirely; cache similarity scoring with a basic model identifies near-duplicate requests.

Key Claims and Findings

  • Observability is a “critical capability” in production environments — without it agents are black boxes and cannot be debugged, optimized, or trusted.
  • Debugging complex agents (multiple LLM calls, tool interactions, conditional logic) is only tractable with trace data: traces pinpoint exactly which span caused a failure.
  • Real-time cost monitoring can detect bugs causing excessive API loops — unexpected cost spikes are often the first signal of a runaway agent.
  • Model drift is detectable only through online evaluation; offline test sets cannot capture it because the distribution shift happens in production, not in the fixed test data.
  • Offline and online evaluations are not alternatives — they are complementary; production failure cases must continuously refresh offline test datasets to keep evaluation meaningful.

Common Issues Diagnostic Table

IssuePotential Solution
Agent not performing tasks consistentlyRefine prompt (clear objectives); divide into subtasks across multiple agents
Agent running into continuous loopsDefine explicit termination conditions; use a larger reasoning-specialized model
Tool calls not performing wellTest and validate tool output independently; refine tool names, parameters, prompts
Multi-agent system not performing consistentlyMake agent prompts specific and distinct; add a routing/controller agent

Terminology

  • Trace: Complete end-to-end record of a single agent task execution, from receiving a user request to producing a final output.
  • Span: One instrumented step within a trace — typically one LLM call, one tool invocation, or one retrieval operation.
  • Model Drift: Degradation of agent performance over time as real-world input distribution shifts away from the training distribution; only detectable through online evaluation.
  • SLM (Small Language Model): A smaller, cheaper model capable of handling simpler sub-tasks within an agentic workflow, freeing large models for complex reasoning.
  • Router Model: A lightweight dispatcher that classifies request complexity and routes each query to the most cost-effective model capable of handling it.
  • RAGAS: Open-source evaluation framework for RAG workflows; provides metrics including AnswerAccuracy, ContextRelevance, and ResponseGroundedness.
  • LLM Guard: Open-source library for detecting harmful language and prompt injection in LLM outputs; used for safety-oriented automated evaluation.

Connections to Existing Wiki Pages

  • Data Flywheel: What It Is and How It Works — the “continuous improvement loop” and enterprise data refinement described in the data flywheel article is the production-scale realization of the offline → deploy → monitor online → refine evaluation cycle described here; observability data is the primary input to the flywheel’s refinement stage
  • NVIDIA NeMo Agent Toolkit: Evaluation — provides a concrete evaluation harness (Ragas, Trajectory, Tunable RAG evaluators; profiler output; Weave dashboard) implementing the offline evaluation workflow and automated metrics described here; the trace/span model described here is the observability foundation the Toolkit instruments
  • Building Autonomous AI with NVIDIA Agentic NeMo — NeMo’s multi-hop latency challenge and TensorRT-LLM/quantization strategies discussed there correspond directly to the latency metric and smaller-model cost management strategies described here