Improve AI Code Generation Using NVIDIA NeMo Agent Toolkit
By Christian Munley — NVIDIA Developer Blog, 2025-03-18
Abstract
This article demonstrates building a test-driven AI coding agent using NVIDIA-NeMo-Agent-Toolkit with LangGraph and DeepSeek-R1, framing agentic code generation as a test-time compute scaling problem. The coding agent operates as a structured loop: a code LLM generates a patch given a problem statement and existing tests; a sandboxed executor runs the unit tests; if tests fail, a reasoning model (DeepSeek-R1) diagnoses the error and suggests a fix; the loop repeats until all tests pass or the iteration budget is exhausted. The article also shows how to wrap this coding agent as a callable tool inside a ReACT-style supervisor agent that orchestrates multiple specialists asynchronously — enabling complex software tasks like research, error localisation, and test generation to run in parallel.
Key Concepts
- Test-time compute scaling: improving AI performance at inference by allocating more compute for reasoning and iterative refinement rather than expanding pre-training scale
- Flow engineering: a structured-agent design pattern where states and transitions are predefined but agent/tool execution within each state retains autonomy — a practical middle ground between fully flexible and fully scripted agents
- Test-driven coding agent: agent combining a code LLM for patch generation and a reasoning LLM (DeepSeek-R1) for error analysis; correctness is verified by runnable unit tests rather than heuristic scoring
- Sandboxed code execution: tool providing a safe, controlled environment for running generated code; prevents arbitrary execution while giving the agent real feedback
- Supervisor agent: ReACT-style orchestrator managing specialised sub-agents (code generation, research, error localisation, test generation) that can be invoked asynchronously
- YAML configuration: Agent Toolkit’s declarative specification for workflows — swapping models, tools, or logic requires only a config change, not code rewriting
Key Claims and Findings
- Agentic code generation is an ideal test-time compute use case because success is objectively verifiable (tests pass or fail)
- DeepSeek-R1’s chain-of-thought reasoning accurately guides the code generation model through a debugging loop across multiple iterations
- Agent Toolkit reduces the operational friction around evaluation, deployment, and optimisation — all major challenges in production agentic AI
- The
aiq evalharness enables rapid iteration: change a model or prompt in the config, rerun eval, compare metrics automatically - Supervisor agents enable async parallel execution of specialised agents, making complex multi-step tasks more efficient
Agent Design Pattern
Problem statement + code + unit tests
│
▼
[Code LLM] → generate patch
│
▼
[Sandbox] → run unit tests
│
Pass? ──Yes──► Done
│
No
│
▼
[DeepSeek-R1] → diagnose failure, suggest fix
│
└──────────► repeat (up to N iterations)
Toolkit CLI Reference
| Command | Purpose |
|---|---|
aiq workflow create <name> | Scaffold a new project template with default workflow and config |
aiq eval | Run evaluation harness against a dataset using configurable metrics |
aiq serve | Launch the workflow as a stateless REST microservice |
Terminology
aiq scaffold: CLI subcommand generating a new Agent Toolkit project templateaiq eval: evaluation CLI that tests agents against datasets and scores outputs with customisable metrics- Beam search / reasoning models: inference-time search methods that explore multiple reasoning paths before committing to a final answer (e.g., DeepSeek-R1, OpenAI o1)
Connections to Existing Wiki Pages
- NVIDIA NeMo Agent Toolkit — product overview of the toolkit used in this tutorial
- Improve AI Code Generation (Agent Development angle) — cross-section page focusing on agent design patterns
- NeMo Agent Toolkit: Evaluation —
aiq evalharness described in depth - Understanding the Planning of LLM Agents — survey contextualising test-time compute scaling and iterative planning