Improve AI Code Generation Using NVIDIA NeMo Agent Toolkit

By Christian Munley — NVIDIA Developer Blog, 2025-03-18

Abstract

This article demonstrates building a test-driven AI coding agent using NVIDIA-NeMo-Agent-Toolkit with LangGraph and DeepSeek-R1, framing agentic code generation as a test-time compute scaling problem. The coding agent operates as a structured loop: a code LLM generates a patch given a problem statement and existing tests; a sandboxed executor runs the unit tests; if tests fail, a reasoning model (DeepSeek-R1) diagnoses the error and suggests a fix; the loop repeats until all tests pass or the iteration budget is exhausted. The article also shows how to wrap this coding agent as a callable tool inside a ReACT-style supervisor agent that orchestrates multiple specialists asynchronously — enabling complex software tasks like research, error localisation, and test generation to run in parallel.

Key Concepts

Test-time compute scaling: improving AI performance at inference by allocating more compute for reasoning and iterative refinement rather than expanding pre-training scale
Flow engineering: a structured-agent design pattern where states and transitions are predefined but agent/tool execution within each state retains autonomy — a practical middle ground between fully flexible and fully scripted agents
Test-driven coding agent: agent combining a code LLM for patch generation and a reasoning LLM (DeepSeek-R1) for error analysis; correctness is verified by runnable unit tests rather than heuristic scoring
Sandboxed code execution: tool providing a safe, controlled environment for running generated code; prevents arbitrary execution while giving the agent real feedback
Supervisor agent: ReACT-style orchestrator managing specialised sub-agents (code generation, research, error localisation, test generation) that can be invoked asynchronously
YAML configuration: Agent Toolkit’s declarative specification for workflows — swapping models, tools, or logic requires only a config change, not code rewriting

Key Claims and Findings

Agentic code generation is an ideal test-time compute use case because success is objectively verifiable (tests pass or fail)
DeepSeek-R1’s chain-of-thought reasoning accurately guides the code generation model through a debugging loop across multiple iterations
Agent Toolkit reduces the operational friction around evaluation, deployment, and optimisation — all major challenges in production agentic AI
The aiq eval harness enables rapid iteration: change a model or prompt in the config, rerun eval, compare metrics automatically
Supervisor agents enable async parallel execution of specialised agents, making complex multi-step tasks more efficient

Agent Design Pattern

Problem statement + code + unit tests
       │
       ▼
  [Code LLM] → generate patch
       │
       ▼
  [Sandbox] → run unit tests
       │
   Pass? ──Yes──► Done
       │
       No
       │
       ▼
  [DeepSeek-R1] → diagnose failure, suggest fix
       │
       └──────────► repeat (up to N iterations)

Toolkit CLI Reference

Command	Purpose
`aiq workflow create <name>`	Scaffold a new project template with default workflow and config
`aiq eval`	Run evaluation harness against a dataset using configurable metrics
`aiq serve`	Launch the workflow as a stateless REST microservice

Terminology

aiq scaffold: CLI subcommand generating a new Agent Toolkit project template
aiq eval: evaluation CLI that tests agents against datasets and scores outputs with customisable metrics
Beam search / reasoning models: inference-time search methods that explore multiple reasoning paths before committing to a final answer (e.g., DeepSeek-R1, OpenAI o1)

Connections to Existing Wiki Pages

NVIDIA NeMo Agent Toolkit — product overview of the toolkit used in this tutorial
Improve AI Code Generation (Agent Development angle) — cross-section page focusing on agent design patterns
NeMo Agent Toolkit: Evaluation — aiq eval harness described in depth
Understanding the Planning of LLM Agents — survey contextualising test-time compute scaling and iterative planning

Personal Wiki

Explorer

Improve AI Code Generation Using NVIDIA NeMo Agent Toolkit

Improve AI Code Generation Using NVIDIA NeMo Agent Toolkit

Abstract

Key Concepts

Key Claims and Findings

Agent Design Pattern

Toolkit CLI Reference

Terminology

Connections to Existing Wiki Pages

Graph View

Table of Contents

Backlinks