Improve AI Code Generation Using NVIDIA NeMo Agent Toolkit

By Christian Munley — NVIDIA Developer Blog, 2025-03-18

Abstract

This article demonstrates building a test-driven AI coding agent using NVIDIA-NeMo-Agent-Toolkit with LangGraph and DeepSeek-R1, framing agentic code generation as a test-time compute scaling problem. The coding agent operates as a structured loop: a code LLM generates a patch given a problem statement and existing tests; a sandboxed executor runs the unit tests; if tests fail, a reasoning model (DeepSeek-R1) diagnoses the error and suggests a fix; the loop repeats until all tests pass or the iteration budget is exhausted. The article also shows how to wrap this coding agent as a callable tool inside a ReACT-style supervisor agent that orchestrates multiple specialists asynchronously — enabling complex software tasks like research, error localisation, and test generation to run in parallel.

Key Concepts

  • Test-time compute scaling: improving AI performance at inference by allocating more compute for reasoning and iterative refinement rather than expanding pre-training scale
  • Flow engineering: a structured-agent design pattern where states and transitions are predefined but agent/tool execution within each state retains autonomy — a practical middle ground between fully flexible and fully scripted agents
  • Test-driven coding agent: agent combining a code LLM for patch generation and a reasoning LLM (DeepSeek-R1) for error analysis; correctness is verified by runnable unit tests rather than heuristic scoring
  • Sandboxed code execution: tool providing a safe, controlled environment for running generated code; prevents arbitrary execution while giving the agent real feedback
  • Supervisor agent: ReACT-style orchestrator managing specialised sub-agents (code generation, research, error localisation, test generation) that can be invoked asynchronously
  • YAML configuration: Agent Toolkit’s declarative specification for workflows — swapping models, tools, or logic requires only a config change, not code rewriting

Key Claims and Findings

  • Agentic code generation is an ideal test-time compute use case because success is objectively verifiable (tests pass or fail)
  • DeepSeek-R1’s chain-of-thought reasoning accurately guides the code generation model through a debugging loop across multiple iterations
  • Agent Toolkit reduces the operational friction around evaluation, deployment, and optimisation — all major challenges in production agentic AI
  • The aiq eval harness enables rapid iteration: change a model or prompt in the config, rerun eval, compare metrics automatically
  • Supervisor agents enable async parallel execution of specialised agents, making complex multi-step tasks more efficient

Agent Design Pattern

Problem statement + code + unit tests
       │
       ▼
  [Code LLM] → generate patch
       │
       ▼
  [Sandbox] → run unit tests
       │
   Pass? ──Yes──► Done
       │
       No
       │
       ▼
  [DeepSeek-R1] → diagnose failure, suggest fix
       │
       └──────────► repeat (up to N iterations)

Toolkit CLI Reference

CommandPurpose
aiq workflow create <name>Scaffold a new project template with default workflow and config
aiq evalRun evaluation harness against a dataset using configurable metrics
aiq serveLaunch the workflow as a stateless REST microservice

Terminology

  • aiq scaffold: CLI subcommand generating a new Agent Toolkit project template
  • aiq eval: evaluation CLI that tests agents against datasets and scores outputs with customisable metrics
  • Beam search / reasoning models: inference-time search methods that explore multiple reasoning paths before committing to a final answer (e.g., DeepSeek-R1, OpenAI o1)

Connections to Existing Wiki Pages