NVIDIA NeMo Agent Toolkit: Agent Evaluation
See also: NVIDIA NeMo Agent Toolkit (NVIDIA Platform view)
Abstract
This page summarizes the evaluation documentation from NVIDIA’s open-source NeMo Agent Toolkit (NAT), specifically the evaluate.md reference in the NeMo-Agent-Toolkit GitHub repository. NAT provides a complete evaluation harness for agentic workflows: the nat eval CLI command accepts a YAML configuration specifying a dataset, evaluator set, and output directory, runs the full workflow against each dataset entry, and produces per-evaluator score files alongside profiler metrics and trace visualization artifacts. The toolkit ships five built-in evaluator types (Ragas, Trajectory, Tunable RAG, LangSmith, and LangSmith LLM-as-judge), supports custom evaluator and dataset loader plugins, and integrates with Weights & Biases Weave for dashboard-based result comparison. The recommended judge LLM for NVIDIA-native Ragas metrics is nvidia/Llama-3_3-Nemotron-Super-49B-v1.
Key Concepts
nat evalCLI: Core evaluation command. Takes--config_file(YAML), optional--endpoint(remote workflow server),--reps(repetitions for variance analysis),--skip_completed_entries(resume interrupted runs),--skip_workflow(offline re-evaluation of pre-generated outputs), and--override(inline config overrides).- Evaluation Configuration Structure: The
evalYAML block specifies:eval.general.dataset— dataset type (json,jsonl,csv,xls,parquet,custom), file path, optionalfilter(allowlist/denylist on dataset fields), optionalstructureoverrides for column nameseval.general.output_dir/eval.general.output.dir— where artifacts are writteneval.general.max_concurrency— parallel execution limit (default 8); reduce to avoid rate-limitingeval.evaluators— named evaluator blocks, each with_typeand evaluator-specific parameters
- Evaluation Output Artifacts:
workflow_output.json— per-sample execution results: question, expected answer, generated answer, intermediate stepsconfig_original.yml/config_effective.yml/config_metadata.json— reproducibility artifacts<evaluator-name>_output.json— per-evaluator scores and reasoning, one file per configured evaluatorstandardized_data_all.csv— per-request profiler metrics (latency, token counts, error flags) — profiler onlyworkflow_profiling_metrics.json— aggregated profiler statistics (means, percentiles, bottleneck scores) — profiler onlyworkflow_profiling_report.txt— human-readable latency/bottleneck summary — profiler onlygantt_chart.png— Gantt timeline of LLM/tool spans for visual inspection — profiler onlyinference_optimization.json— token efficiency and caching signals — profiler only, requirescompute_llm_metrics
- Built-in Evaluators:
| Evaluator | _type | What It Measures |
|---|---|---|
| Ragas | ragas | RAG quality via NV-native metrics: AnswerAccuracy, ContextRelevance, ResponseGroundedness. Returns float 0–1 using a judge LLM. |
| Trajectory | trajectory | Quality of the agent’s tool-use decision path (intermediate steps), scored by a judge LLM (0–1). Recommend max_tokens: 1024. |
| Tunable RAG | tunable_rag_evaluator | Flexible LLM-as-judge with tunable prompt, custom scoring weights (coverage/correctness/relevance), and optional default_scoring: false for fully custom rubric. |
| LangSmith | langsmith | Built-in openevals evaluators by short name: exact_match, levenshtein_distance. |
| LangSmith custom | langsmith_custom | Any LangSmith-compatible evaluator imported by Python dotted path. Supports RunEvaluator class, (run, example) function, or (inputs, outputs, reference_outputs) function. |
| LangSmith judge | langsmith_judge | LLM-as-judge via openevals create_llm_as_judge. Supports prebuilt prompts (correctness, hallucination) or custom f-string templates. Judge LLM must support structured output (JSON schema mode). |
- Recommended Judge LLMs (Ragas NV metrics leaderboard):
nvidia/Llama-3_3-Nemotron-Super-49B-v1mistralai/mixtral-8x22b-instruct-v0.1mistralai/mixtral-8x7b-instruct-v0.1meta/llama-3.1-70b-instructmeta/llama-3.3-70b-instruct
- EvalCallback Protocol: Allows observability providers to hook into the evaluation lifecycle. Two hooks:
on_dataset_loaded(fires after dataset load, before workflow runs) andon_eval_complete(fires after all scores are computed, receivesEvalResultwithmetric_scoresand per-item results). LangSmith implements this for structured experiment tracking in LangSmith Datasets & Experiments UI. - Weave Visualization: Install
nvidia-nat[weave], add ageneral.telemetry.tracing.weaveblock to the config. Enables trace timeline view and multi-run comparison dashboard in W&B Weave;workflow_aliasdifferentiates runs within the same project. - Remote Evaluation:
nat eval --endpoint <url>runs the workflow on a remote NAT server and evaluates locally. Evaluation config captures the evaluation settings but not the remote workflow config (both are needed for full reproducibility).
Key Claims and Findings
- NAT’s Ragas evaluator uses NVIDIA-native metrics (
AnswerAccuracy,ContextRelevance,ResponseGroundedness) via NIM-served judge LLMs — scores are floating-point 0–1 where 1.0 indicates a perfect match;max_tokens: 8is sufficient for these metrics. - Trajectory evaluation scores the path an agent took (which tools it called, in what order) rather than just the final answer — this is qualitatively different from RAG accuracy metrics and is particularly valuable for debugging multi-step reasoning failures.
- The plugin architecture allows custom evaluators and dataset loaders, making NAT extensible to domain-specific metrics beyond the built-in set.
- Running multiple repetitions (
--reps) is necessary to measure variance in non-deterministic agentic outputs; a single run may not be representative. - The
--skip_completed_entriesresume mechanism makes large-dataset evaluation practical when LLM rate limits cause partial failures.
Terminology
nat eval: NeMo Agent Toolkit CLI subcommand for running evaluation; wraps workflow execution + evaluator scoring in a single configurable pipeline.- EvaluationHarness / ATIF: NeMo Agent Toolkit internal evaluation framework;
nvidia-nat-evalis the standalone package,nvidia-nat[eval]installs the full runtime. - Ragas: Open-source LLM evaluation framework (explodinggradients); NAT uses NVIDIA-flavored metrics surfaced via the Ragas API.
- openevals: LangChain’s evaluation library providing prebuilt LLM-as-judge prompts (
correctness,hallucination) and evaluation interfaces. - Trajectory Evaluation: Assessment of an agent’s sequence of tool calls against the available tool set, scored by a judge LLM.
- Tunable RAG Evaluator: NAT’s built-in customizable LLM-judge with tunable prompt, scoring rubric weights (coverage/correctness/relevance), and toggle between default and fully custom scoring.
- Weave: Weights & Biases experiment tracking and visualization platform; NAT integrates via
nvidia-nat[weave]plugin.
Connections to Existing Wiki Pages
- AI Agents in Production: Observability & Evaluation — defines the conceptual offline/online evaluation loop and key metrics (latency, accuracy, cost) that NAT’s
nat evalCLI and profiler artifacts operationalize; the trace/span observability model described there is what NAT instruments - Data Flywheel: What It Is and How It Works — NeMo Evaluator in the flywheel maps directly to NAT’s evaluation harness; the flywheel’s Stage 3 (model evaluation for quality assurance before redeployment) is what
nat evalimplements - Building Autonomous AI with NVIDIA Agentic NeMo — provides the broader NeMo production architecture context; NAT is the evaluation layer within that stack; NeMo Guardrails and NIM inference serving described there are the same components consumed by
nat evalas the workflow under test