NVIDIA NeMo Agent Toolkit: Evaluation (NVIDIA Platform)

Full summary: NVIDIA NeMo Agent Toolkit: Agent Evaluation

This cross-section page covers the NeMo Agent Toolkit evaluation module from the NVIDIA platform implementation angle — specifically how NVIDIA-native NIM inference and Nemotron judge models integrate with the evaluation harness.

NVIDIA Platform Components in NAT Evaluation

  • NIM as Workflow Backend: nat eval can target a remote NIM-served workflow via --endpoint. Any NVIDIA NIM endpoint (LLM, RAG pipeline, or multi-agent workflow) can be evaluated by pointing NAT at its API — no code changes to the workflow required.
  • NIM as Judge LLM: Evaluators (Ragas, Trajectory, Tunable RAG, LangSmith judge) require a judge LLM configured in the llms: section. NVIDIA NIM serves the recommended judge models:
    • nvidia/Llama-3_3-Nemotron-Super-49B-v1 — top-ranked for NVIDIA Ragas NV metrics
    • meta/llama-3.1-70b-instruct, meta/llama-3.3-70b-instruct — supported via NIM
  • NVIDIA Ragas NV Metrics: NAT ships NVIDIA-flavored Ragas metrics (AnswerAccuracy, ContextRelevance, ResponseGroundedness) tuned for use with NIM judge LLMs. These are distinct from standard Ragas metrics — the prompts are fixed and NVIDIA-optimized.
  • Local NIM Deployment for Evaluation: When NIM API rate limits interfere with evaluation runs ([429] Too Many Requests), the documented solution is to deploy NIM locally (via NIM containers) and point the evaluator base_url to http://localhost:8000/v1 — eliminating rate limits and keeping evaluation data private.
  • Profiler Integration: When eval.general.profiler is enabled, NAT’s profiler instruments LLM latency, token counts, and concurrency — producing gantt_chart.png and inference_optimization.json. These artifacts directly measure NIM inference performance characteristics (token efficiency, caching signals, prompt-prefix analysis) within the evaluation workflow.

Connections to Existing Wiki Pages