Harness, Scaffold, and the AI Agent Terms Worth Getting Right
Hugging Face blog post by Sergio Paniego and Aritra Roy Gosthipaty, published 2026-05-25. A practical glossary of agent vocabulary that resists consistent definition across the field.
Abstract
The vocabulary of LLM agents has evolved faster than shared understanding, with the same terms — harness, scaffold, agent, tool, skill, sub-agent — carrying different meanings across frameworks, products, and research papers. This glossary attempts to ground the most commonly misused terms by tracing each to its functional role in a running agent system. The central distinction is between scaffolding (the behavior-defining layer: system prompt, tool descriptions, context management) and harness (the execution loop that calls the model, handles tool calls, and decides when to stop). Together, these and the model make an agent: Agent = Model + Harness. The article also covers context engineering, policy, sub-agents, orchestrators, and the RL-specific vocabulary that appears at training time (rollout, trainer, reward, rubric). This builds directly on the conceptual foundation in What are AI Agents? and extends the multi-agent picture in What are Multi-Agent Systems?.
Key Concepts
Model
The bare LLM: text in, text out. No memory between calls, no loop. Can express intent to call a tool, but cannot execute it. The model answers one prompt and stops. Everything else is the harness.
Scaffolding
The behavior-defining layer wrapped around the model:
- System prompt and tool descriptions (what the model is told)
- Output format and response parsing (how its outputs are interpreted)
- Context management (what it remembers across steps)
Scaffolding shapes how the model sees the world and acts in it — both during training and at inference. Some usage (particularly from Claude Code’s own documentation) uses “harness” for this whole outer layer; the scaffold/harness distinction matters most when reasoning about training pipelines, where the two can be varied independently.
Harness
The execution layer inside the agent:
- Calls the model in a loop
- Intercepts tool calls and routes them to the right function
- Feeds results back into context
- Decides when to stop (completion criteria, error handling, guardrails)
Harness engineering is the discipline of designing this layer: stopping conditions, error recovery, cost management, and safety rails. At evaluation time, the same pattern appears as an eval harness — it runs a fixed scenario set and records metrics instead of updating weights.
Products like Claude Code and Codex are specific harnesses built on and tightly coupled to specific models. Swapping the model changes the experience; swapping the harness changes it equally.
Agent
Agent = Model + Harness. The model wrapped in scaffolding and a harness that lets it act in a loop rather than just respond. In RL terms, an agent is a function from observation to action; the environment provides the next observation and the loop repeats. In the LLM context, tool call results are observations returned by the environment.
The model, the harness, and the product are three distinct things: two products using the same model can feel completely different if their harnesses make different choices.
Context Engineering
Designing what enters the agent’s context window at each step: system prompt, tool descriptions, conversation history, retrieved knowledge. The harness actively manages this throughout the run, not just at setup time.
- Short-term memory: what is in the context window during a single run — conversation history, tool results, previous reasoning steps
- Long-term memory: persists across sessions; stored externally and retrieved on demand, then injected into context when relevant
At training time, context engineering shapes what gets learned; getting it wrong requires retraining. At inference it is just text — change a prompt and redeploy.
Policy
The behavior function: given any situation, defines the probability of taking each possible action. In LLM agents, policy is split between the model weights and the surrounding scaffolding and harness. The same checkpoint can behave very differently depending on prompts, tools, memory, and execution loop. A policy is not an agent — it is the specification of behavior; the agent is the full system that enacts it.
Tool Use
How agents reach outside themselves: APIs, code interpreters, databases, web search, file systems. The model expresses intent to use a tool in a structured format; the harness receives the call as a first-class object, routes it, and feeds the result back into context.
Skills
Reusable, structured packages of knowledge that enable multi-step tasks. A tool is an action (“run this command”); a skill bundles everything needed to accomplish a goal (“investigate this bug, form a hypothesis, write a fix”). Skills are portable across agents and loaded on demand. The tool/skill/sub-agent boundary shifts across frameworks.
Sub-agents and Orchestrators
A sub-agent is an agent called by another agent to handle a specific subtask. It has its own model and scaffold, reasons independently, and returns a result. This distinguishes it from a tool (a function call) or a skill (packaged knowledge): a sub-agent can itself reason, use tools, and call further sub-agents. The calling agent is called an orchestrator — a higher-level controller that manages agents as units, each running their own harness, rather than driving a single model through its loop.
RL Training Vocabulary
These terms are specific to training pipelines, where the agent runs through tasks, gets scored, and the model’s weights are updated:
| Term | Definition |
|---|---|
| RL Environment | A stateful object that accepts an action, updates internal state, and returns an observation. In LLM agents, actions are typically tool calls. |
| Trainer | Runs many agent episodes, scores them, and updates model weights. Example: GRPOTrainer in TRL — handles episode generation, reward scoring, and weight updates in one class. |
| Rollout | One full agent run from start to finish: what the agent saw, did, and what reward it received at each step. Also called trajectory or trace. The raw data RL algorithms learn from. |
| Reward | The score that tells the training algorithm whether the model is improving. Can be verifiable (tests pass/fail), or learned (human preferences, LLM-as-judge); sparse (end of episode) or dense (per step). |
| Rubric | Breaks the reward into explicit weighted dimensions rather than a single number. Frameworks like OpenEnv and Verifiers implement rubrics as composable objects (WeightedSum, Sequential, Gate). |
Key Claims
- The harness/scaffold distinction matters most in training pipelines, where they can be varied independently; at inference the blur is acceptable
- Agent = Model + Harness is the community’s convergent framing — if you are not the model, you are the harness
- Two products on the same underlying model behave differently because their harnesses make different choices — the product is neither the model nor the harness alone
- Context engineering carries asymmetric risk: at training, errors require retraining; at inference, errors are just text changes
Connections to Existing Wiki Pages
- What are AI Agents? — foundational agent definition this article sharpens with the harness/scaffold decomposition
- What are Multi-Agent Systems? — covers orchestrator and sub-agent patterns in production depth
- Building Autonomous AI with NVIDIA Agentic NeMo — a concrete harness built on NeMo; the harness/scaffold vocabulary applies directly
- Design Considerations of Advanced Agentic AI — covers stopping conditions and error handling (core harness engineering concerns)
- Cognition, Planning, and Memory — short-term and long-term memory discussed here map to that topic area
- Evaluation and Tuning — eval harness concept, rollout/reward/rubric vocabulary connects to evaluation workflows