NVIDIA NeMo Agent Toolkit: Agent Evaluation

See also: NVIDIA NeMo Agent Toolkit (NVIDIA Platform view)

Abstract

This page summarizes the evaluation documentation from NVIDIA’s open-source NeMo Agent Toolkit (NAT), specifically the evaluate.md reference in the NeMo-Agent-Toolkit GitHub repository. NAT provides a complete evaluation harness for agentic workflows: the nat eval CLI command accepts a YAML configuration specifying a dataset, evaluator set, and output directory, runs the full workflow against each dataset entry, and produces per-evaluator score files alongside profiler metrics and trace visualization artifacts. The toolkit ships five built-in evaluator types (Ragas, Trajectory, Tunable RAG, LangSmith, and LangSmith LLM-as-judge), supports custom evaluator and dataset loader plugins, and integrates with Weights & Biases Weave for dashboard-based result comparison. The recommended judge LLM for NVIDIA-native Ragas metrics is nvidia/Llama-3_3-Nemotron-Super-49B-v1.


Key Concepts

  • nat eval CLI: Core evaluation command. Takes --config_file (YAML), optional --endpoint (remote workflow server), --reps (repetitions for variance analysis), --skip_completed_entries (resume interrupted runs), --skip_workflow (offline re-evaluation of pre-generated outputs), and --override (inline config overrides).
  • Evaluation Configuration Structure: The eval YAML block specifies:
    • eval.general.dataset — dataset type (json, jsonl, csv, xls, parquet, custom), file path, optional filter (allowlist/denylist on dataset fields), optional structure overrides for column names
    • eval.general.output_dir / eval.general.output.dir — where artifacts are written
    • eval.general.max_concurrency — parallel execution limit (default 8); reduce to avoid rate-limiting
    • eval.evaluators — named evaluator blocks, each with _type and evaluator-specific parameters
  • Evaluation Output Artifacts:
    • workflow_output.json — per-sample execution results: question, expected answer, generated answer, intermediate steps
    • config_original.yml / config_effective.yml / config_metadata.json — reproducibility artifacts
    • <evaluator-name>_output.json — per-evaluator scores and reasoning, one file per configured evaluator
    • standardized_data_all.csv — per-request profiler metrics (latency, token counts, error flags) — profiler only
    • workflow_profiling_metrics.json — aggregated profiler statistics (means, percentiles, bottleneck scores) — profiler only
    • workflow_profiling_report.txt — human-readable latency/bottleneck summary — profiler only
    • gantt_chart.png — Gantt timeline of LLM/tool spans for visual inspection — profiler only
    • inference_optimization.json — token efficiency and caching signals — profiler only, requires compute_llm_metrics
  • Built-in Evaluators:
Evaluator_typeWhat It Measures
RagasragasRAG quality via NV-native metrics: AnswerAccuracy, ContextRelevance, ResponseGroundedness. Returns float 0–1 using a judge LLM.
TrajectorytrajectoryQuality of the agent’s tool-use decision path (intermediate steps), scored by a judge LLM (0–1). Recommend max_tokens: 1024.
Tunable RAGtunable_rag_evaluatorFlexible LLM-as-judge with tunable prompt, custom scoring weights (coverage/correctness/relevance), and optional default_scoring: false for fully custom rubric.
LangSmithlangsmithBuilt-in openevals evaluators by short name: exact_match, levenshtein_distance.
LangSmith customlangsmith_customAny LangSmith-compatible evaluator imported by Python dotted path. Supports RunEvaluator class, (run, example) function, or (inputs, outputs, reference_outputs) function.
LangSmith judgelangsmith_judgeLLM-as-judge via openevals create_llm_as_judge. Supports prebuilt prompts (correctness, hallucination) or custom f-string templates. Judge LLM must support structured output (JSON schema mode).
  • Recommended Judge LLMs (Ragas NV metrics leaderboard):
    1. nvidia/Llama-3_3-Nemotron-Super-49B-v1
    2. mistralai/mixtral-8x22b-instruct-v0.1
    3. mistralai/mixtral-8x7b-instruct-v0.1
    4. meta/llama-3.1-70b-instruct
    5. meta/llama-3.3-70b-instruct
  • EvalCallback Protocol: Allows observability providers to hook into the evaluation lifecycle. Two hooks: on_dataset_loaded (fires after dataset load, before workflow runs) and on_eval_complete (fires after all scores are computed, receives EvalResult with metric_scores and per-item results). LangSmith implements this for structured experiment tracking in LangSmith Datasets & Experiments UI.
  • Weave Visualization: Install nvidia-nat[weave], add a general.telemetry.tracing.weave block to the config. Enables trace timeline view and multi-run comparison dashboard in W&B Weave; workflow_alias differentiates runs within the same project.
  • Remote Evaluation: nat eval --endpoint <url> runs the workflow on a remote NAT server and evaluates locally. Evaluation config captures the evaluation settings but not the remote workflow config (both are needed for full reproducibility).

Key Claims and Findings

  • NAT’s Ragas evaluator uses NVIDIA-native metrics (AnswerAccuracy, ContextRelevance, ResponseGroundedness) via NIM-served judge LLMs — scores are floating-point 0–1 where 1.0 indicates a perfect match; max_tokens: 8 is sufficient for these metrics.
  • Trajectory evaluation scores the path an agent took (which tools it called, in what order) rather than just the final answer — this is qualitatively different from RAG accuracy metrics and is particularly valuable for debugging multi-step reasoning failures.
  • The plugin architecture allows custom evaluators and dataset loaders, making NAT extensible to domain-specific metrics beyond the built-in set.
  • Running multiple repetitions (--reps) is necessary to measure variance in non-deterministic agentic outputs; a single run may not be representative.
  • The --skip_completed_entries resume mechanism makes large-dataset evaluation practical when LLM rate limits cause partial failures.

Terminology

  • nat eval: NeMo Agent Toolkit CLI subcommand for running evaluation; wraps workflow execution + evaluator scoring in a single configurable pipeline.
  • EvaluationHarness / ATIF: NeMo Agent Toolkit internal evaluation framework; nvidia-nat-eval is the standalone package, nvidia-nat[eval] installs the full runtime.
  • Ragas: Open-source LLM evaluation framework (explodinggradients); NAT uses NVIDIA-flavored metrics surfaced via the Ragas API.
  • openevals: LangChain’s evaluation library providing prebuilt LLM-as-judge prompts (correctness, hallucination) and evaluation interfaces.
  • Trajectory Evaluation: Assessment of an agent’s sequence of tool calls against the available tool set, scored by a judge LLM.
  • Tunable RAG Evaluator: NAT’s built-in customizable LLM-judge with tunable prompt, scoring rubric weights (coverage/correctness/relevance), and toggle between default and fully custom scoring.
  • Weave: Weights & Biases experiment tracking and visualization platform; NAT integrates via nvidia-nat[weave] plugin.

Connections to Existing Wiki Pages

  • AI Agents in Production: Observability & Evaluation — defines the conceptual offline/online evaluation loop and key metrics (latency, accuracy, cost) that NAT’s nat eval CLI and profiler artifacts operationalize; the trace/span observability model described there is what NAT instruments
  • Data Flywheel: What It Is and How It Works — NeMo Evaluator in the flywheel maps directly to NAT’s evaluation harness; the flywheel’s Stage 3 (model evaluation for quality assurance before redeployment) is what nat eval implements
  • Building Autonomous AI with NVIDIA Agentic NeMo — provides the broader NeMo production architecture context; NAT is the evaluation layer within that stack; NeMo Guardrails and NIM inference serving described there are the same components consumed by nat eval as the workflow under test