NVIDIA NeMo Agent Toolkit: Agent Evaluation

Abstract

This page summarizes the evaluation documentation from NVIDIA’s open-source NeMo Agent Toolkit (NAT), specifically the evaluate.md reference in the NeMo-Agent-Toolkit GitHub repository. NAT provides a complete evaluation harness for agentic workflows: the nat eval CLI command accepts a YAML configuration specifying a dataset, evaluator set, and output directory, runs the full workflow against each dataset entry, and produces per-evaluator score files alongside profiler metrics and trace visualization artifacts. The toolkit ships five built-in evaluator types (Ragas, Trajectory, Tunable RAG, LangSmith, and LangSmith LLM-as-judge), supports custom evaluator and dataset loader plugins, and integrates with Weights & Biases Weave for dashboard-based result comparison. The recommended judge LLM for NVIDIA-native Ragas metrics is nvidia/Llama-3_3-Nemotron-Super-49B-v1.

Key Concepts

nat eval CLI: Core evaluation command. Takes --config_file (YAML), optional --endpoint (remote workflow server), --reps (repetitions for variance analysis), --skip_completed_entries (resume interrupted runs), --skip_workflow (offline re-evaluation of pre-generated outputs), and --override (inline config overrides).
Evaluation Configuration Structure: The eval YAML block specifies:
- eval.general.dataset — dataset type (json, jsonl, csv, xls, parquet, custom), file path, optional filter (allowlist/denylist on dataset fields), optional structure overrides for column names
- eval.general.output_dir / eval.general.output.dir — where artifacts are written
- eval.general.max_concurrency — parallel execution limit (default 8); reduce to avoid rate-limiting
- eval.evaluators — named evaluator blocks, each with _type and evaluator-specific parameters
Evaluation Output Artifacts:
- workflow_output.json — per-sample execution results: question, expected answer, generated answer, intermediate steps
- config_original.yml / config_effective.yml / config_metadata.json — reproducibility artifacts
- <evaluator-name>_output.json — per-evaluator scores and reasoning, one file per configured evaluator
- standardized_data_all.csv — per-request profiler metrics (latency, token counts, error flags) — profiler only
- workflow_profiling_metrics.json — aggregated profiler statistics (means, percentiles, bottleneck scores) — profiler only
- workflow_profiling_report.txt — human-readable latency/bottleneck summary — profiler only
- gantt_chart.png — Gantt timeline of LLM/tool spans for visual inspection — profiler only
- inference_optimization.json — token efficiency and caching signals — profiler only, requires compute_llm_metrics
Built-in Evaluators:

Evaluator	`_type`	What It Measures
Ragas	`ragas`	RAG quality via NV-native metrics: `AnswerAccuracy`, `ContextRelevance`, `ResponseGroundedness`. Returns float 0–1 using a judge LLM.
Trajectory	`trajectory`	Quality of the agent’s tool-use decision path (intermediate steps), scored by a judge LLM (0–1). Recommend `max_tokens: 1024`.
Tunable RAG	`tunable_rag_evaluator`	Flexible LLM-as-judge with tunable prompt, custom scoring weights (coverage/correctness/relevance), and optional `default_scoring: false` for fully custom rubric.
LangSmith	`langsmith`	Built-in openevals evaluators by short name: `exact_match`, `levenshtein_distance`.
LangSmith custom	`langsmith_custom`	Any LangSmith-compatible evaluator imported by Python dotted path. Supports RunEvaluator class, `(run, example)` function, or `(inputs, outputs, reference_outputs)` function.
LangSmith judge	`langsmith_judge`	LLM-as-judge via `openevals` `create_llm_as_judge`. Supports prebuilt prompts (`correctness`, `hallucination`) or custom f-string templates. Judge LLM must support structured output (JSON schema mode).

Recommended Judge LLMs (Ragas NV metrics leaderboard):
1. nvidia/Llama-3_3-Nemotron-Super-49B-v1
2. mistralai/mixtral-8x22b-instruct-v0.1
3. mistralai/mixtral-8x7b-instruct-v0.1
4. meta/llama-3.1-70b-instruct
5. meta/llama-3.3-70b-instruct
EvalCallback Protocol: Allows observability providers to hook into the evaluation lifecycle. Two hooks: on_dataset_loaded (fires after dataset load, before workflow runs) and on_eval_complete (fires after all scores are computed, receives EvalResult with metric_scores and per-item results). LangSmith implements this for structured experiment tracking in LangSmith Datasets & Experiments UI.
Weave Visualization: Install nvidia-nat[weave], add a general.telemetry.tracing.weave block to the config. Enables trace timeline view and multi-run comparison dashboard in W&B Weave; workflow_alias differentiates runs within the same project.
Remote Evaluation: nat eval --endpoint <url> runs the workflow on a remote NAT server and evaluates locally. Evaluation config captures the evaluation settings but not the remote workflow config (both are needed for full reproducibility).

Key Claims and Findings

NAT’s Ragas evaluator uses NVIDIA-native metrics (AnswerAccuracy, ContextRelevance, ResponseGroundedness) via NIM-served judge LLMs — scores are floating-point 0–1 where 1.0 indicates a perfect match; max_tokens: 8 is sufficient for these metrics.
Trajectory evaluation scores the path an agent took (which tools it called, in what order) rather than just the final answer — this is qualitatively different from RAG accuracy metrics and is particularly valuable for debugging multi-step reasoning failures.
The plugin architecture allows custom evaluators and dataset loaders, making NAT extensible to domain-specific metrics beyond the built-in set.
Running multiple repetitions (--reps) is necessary to measure variance in non-deterministic agentic outputs; a single run may not be representative.
The --skip_completed_entries resume mechanism makes large-dataset evaluation practical when LLM rate limits cause partial failures.

Terminology

nat eval: NeMo Agent Toolkit CLI subcommand for running evaluation; wraps workflow execution + evaluator scoring in a single configurable pipeline.
EvaluationHarness / ATIF: NeMo Agent Toolkit internal evaluation framework; nvidia-nat-eval is the standalone package, nvidia-nat[eval] installs the full runtime.
Ragas: Open-source LLM evaluation framework (explodinggradients); NAT uses NVIDIA-flavored metrics surfaced via the Ragas API.
openevals: LangChain’s evaluation library providing prebuilt LLM-as-judge prompts (correctness, hallucination) and evaluation interfaces.
Trajectory Evaluation: Assessment of an agent’s sequence of tool calls against the available tool set, scored by a judge LLM.
Tunable RAG Evaluator: NAT’s built-in customizable LLM-judge with tunable prompt, scoring rubric weights (coverage/correctness/relevance), and toggle between default and fully custom scoring.
Weave: Weights & Biases experiment tracking and visualization platform; NAT integrates via nvidia-nat[weave] plugin.

Connections to Existing Wiki Pages

AI Agents in Production: Observability & Evaluation — defines the conceptual offline/online evaluation loop and key metrics (latency, accuracy, cost) that NAT’s nat eval CLI and profiler artifacts operationalize; the trace/span observability model described there is what NAT instruments
Data Flywheel: What It Is and How It Works — NeMo Evaluator in the flywheel maps directly to NAT’s evaluation harness; the flywheel’s Stage 3 (model evaluation for quality assurance before redeployment) is what nat eval implements
Building Autonomous AI with NVIDIA Agentic NeMo — provides the broader NeMo production architecture context; NAT is the evaluation layer within that stack; NeMo Guardrails and NIM inference serving described there are the same components consumed by nat eval as the workflow under test

Personal Wiki

Explorer

NVIDIA NeMo Agent Toolkit: Agent Evaluation

NVIDIA NeMo Agent Toolkit: Agent Evaluation

Abstract

Key Concepts

Key Claims and Findings

Terminology

Connections to Existing Wiki Pages

Graph View

Table of Contents

Backlinks