NVIDIA NeMo Agent Toolkit: Evaluation (NVIDIA Platform)
Full summary: NVIDIA NeMo Agent Toolkit: Agent Evaluation
This cross-section page covers the NeMo Agent Toolkit evaluation module from the NVIDIA platform implementation angle — specifically how NVIDIA-native NIM inference and Nemotron judge models integrate with the evaluation harness.
NVIDIA Platform Components in NAT Evaluation
- NIM as Workflow Backend:
nat evalcan target a remote NIM-served workflow via--endpoint. Any NVIDIA NIM endpoint (LLM, RAG pipeline, or multi-agent workflow) can be evaluated by pointing NAT at its API — no code changes to the workflow required. - NIM as Judge LLM: Evaluators (Ragas, Trajectory, Tunable RAG, LangSmith judge) require a judge LLM configured in the
llms:section. NVIDIA NIM serves the recommended judge models:nvidia/Llama-3_3-Nemotron-Super-49B-v1— top-ranked for NVIDIA Ragas NV metricsmeta/llama-3.1-70b-instruct,meta/llama-3.3-70b-instruct— supported via NIM
- NVIDIA Ragas NV Metrics: NAT ships NVIDIA-flavored Ragas metrics (
AnswerAccuracy,ContextRelevance,ResponseGroundedness) tuned for use with NIM judge LLMs. These are distinct from standard Ragas metrics — the prompts are fixed and NVIDIA-optimized. - Local NIM Deployment for Evaluation: When NIM API rate limits interfere with evaluation runs (
[429] Too Many Requests), the documented solution is to deploy NIM locally (via NIM containers) and point the evaluatorbase_urltohttp://localhost:8000/v1— eliminating rate limits and keeping evaluation data private. - Profiler Integration: When
eval.general.profileris enabled, NAT’s profiler instruments LLM latency, token counts, and concurrency — producinggantt_chart.pngandinference_optimization.json. These artifacts directly measure NIM inference performance characteristics (token efficiency, caching signals, prompt-prefix analysis) within the evaluation workflow.
Connections to Existing Wiki Pages
- NVIDIA NeMo Agent Toolkit: Agent Evaluation — primary page with full evaluation harness documentation
- Optimization — NVIDIA Triton Inference Server — Triton is the inference serving layer underlying NIM; evaluation workflows instrumented by NAT’s profiler will reflect Triton optimization choices (batching, quantization, model parallelism)
- Building Autonomous AI with NVIDIA Agentic NeMo — NeMo production stack overview; NAT evaluation is the quality gate within that stack before production promotion