NVIDIA NeMo Agent Toolkit: Evaluation (NVIDIA Platform)

Full summary: NVIDIA NeMo Agent Toolkit: Agent Evaluation

This cross-section page covers the NeMo Agent Toolkit evaluation module from the NVIDIA platform implementation angle — specifically how NVIDIA-native NIM inference and Nemotron judge models integrate with the evaluation harness.

NVIDIA Platform Components in NAT Evaluation

NIM as Workflow Backend: nat eval can target a remote NIM-served workflow via --endpoint. Any NVIDIA NIM endpoint (LLM, RAG pipeline, or multi-agent workflow) can be evaluated by pointing NAT at its API — no code changes to the workflow required.
NIM as Judge LLM: Evaluators (Ragas, Trajectory, Tunable RAG, LangSmith judge) require a judge LLM configured in the llms: section. NVIDIA NIM serves the recommended judge models:
- nvidia/Llama-3_3-Nemotron-Super-49B-v1 — top-ranked for NVIDIA Ragas NV metrics
- meta/llama-3.1-70b-instruct, meta/llama-3.3-70b-instruct — supported via NIM
NVIDIA Ragas NV Metrics: NAT ships NVIDIA-flavored Ragas metrics (AnswerAccuracy, ContextRelevance, ResponseGroundedness) tuned for use with NIM judge LLMs. These are distinct from standard Ragas metrics — the prompts are fixed and NVIDIA-optimized.
Local NIM Deployment for Evaluation: When NIM API rate limits interfere with evaluation runs ([429] Too Many Requests), the documented solution is to deploy NIM locally (via NIM containers) and point the evaluator base_url to http://localhost:8000/v1 — eliminating rate limits and keeping evaluation data private.
Profiler Integration: When eval.general.profiler is enabled, NAT’s profiler instruments LLM latency, token counts, and concurrency — producing gantt_chart.png and inference_optimization.json. These artifacts directly measure NIM inference performance characteristics (token efficiency, caching signals, prompt-prefix analysis) within the evaluation workflow.

Connections to Existing Wiki Pages

NVIDIA NeMo Agent Toolkit: Agent Evaluation — primary page with full evaluation harness documentation
Optimization — NVIDIA Triton Inference Server — Triton is the inference serving layer underlying NIM; evaluation workflows instrumented by NAT’s profiler will reflect Triton optimization choices (batching, quantization, model parallelism)
Building Autonomous AI with NVIDIA Agentic NeMo — NeMo production stack overview; NAT evaluation is the quality gate within that stack before production promotion

Personal Wiki

Explorer

NVIDIA NeMo Agent Toolkit: Evaluation (NVIDIA Platform)

NVIDIA NeMo Agent Toolkit: Evaluation (NVIDIA Platform)

NVIDIA Platform Components in NAT Evaluation

Connections to Existing Wiki Pages

Graph View

Table of Contents

Backlinks