Harness, Scaffold, and the AI Agent Terms Worth Getting Right

Hugging Face blog post by Sergio Paniego and Aritra Roy Gosthipaty, published 2026-05-25. A practical glossary of agent vocabulary that resists consistent definition across the field.

Abstract

The vocabulary of LLM agents has evolved faster than shared understanding, with the same terms — harness, scaffold, agent, tool, skill, sub-agent — carrying different meanings across frameworks, products, and research papers. This glossary attempts to ground the most commonly misused terms by tracing each to its functional role in a running agent system. The central distinction is between scaffolding (the behavior-defining layer: system prompt, tool descriptions, context management) and harness (the execution loop that calls the model, handles tool calls, and decides when to stop). Together, these and the model make an agent: Agent = Model + Harness. The article also covers context engineering, policy, sub-agents, orchestrators, and the RL-specific vocabulary that appears at training time (rollout, trainer, reward, rubric). This builds directly on the conceptual foundation in What are AI Agents? and extends the multi-agent picture in What are Multi-Agent Systems?.

Key Concepts

Model

The bare LLM: text in, text out. No memory between calls, no loop. Can express intent to call a tool, but cannot execute it. The model answers one prompt and stops. Everything else is the harness.

Scaffolding

The behavior-defining layer wrapped around the model:

  • System prompt and tool descriptions (what the model is told)
  • Output format and response parsing (how its outputs are interpreted)
  • Context management (what it remembers across steps)

Scaffolding shapes how the model sees the world and acts in it — both during training and at inference. Some usage (particularly from Claude Code’s own documentation) uses “harness” for this whole outer layer; the scaffold/harness distinction matters most when reasoning about training pipelines, where the two can be varied independently.

Harness

The execution layer inside the agent:

  • Calls the model in a loop
  • Intercepts tool calls and routes them to the right function
  • Feeds results back into context
  • Decides when to stop (completion criteria, error handling, guardrails)

Harness engineering is the discipline of designing this layer: stopping conditions, error recovery, cost management, and safety rails. At evaluation time, the same pattern appears as an eval harness — it runs a fixed scenario set and records metrics instead of updating weights.

Products like Claude Code and Codex are specific harnesses built on and tightly coupled to specific models. Swapping the model changes the experience; swapping the harness changes it equally.

Agent

Agent = Model + Harness. The model wrapped in scaffolding and a harness that lets it act in a loop rather than just respond. In RL terms, an agent is a function from observation to action; the environment provides the next observation and the loop repeats. In the LLM context, tool call results are observations returned by the environment.

The model, the harness, and the product are three distinct things: two products using the same model can feel completely different if their harnesses make different choices.

Context Engineering

Designing what enters the agent’s context window at each step: system prompt, tool descriptions, conversation history, retrieved knowledge. The harness actively manages this throughout the run, not just at setup time.

  • Short-term memory: what is in the context window during a single run — conversation history, tool results, previous reasoning steps
  • Long-term memory: persists across sessions; stored externally and retrieved on demand, then injected into context when relevant

At training time, context engineering shapes what gets learned; getting it wrong requires retraining. At inference it is just text — change a prompt and redeploy.

Policy

The behavior function: given any situation, defines the probability of taking each possible action. In LLM agents, policy is split between the model weights and the surrounding scaffolding and harness. The same checkpoint can behave very differently depending on prompts, tools, memory, and execution loop. A policy is not an agent — it is the specification of behavior; the agent is the full system that enacts it.

Tool Use

How agents reach outside themselves: APIs, code interpreters, databases, web search, file systems. The model expresses intent to use a tool in a structured format; the harness receives the call as a first-class object, routes it, and feeds the result back into context.

Skills

Reusable, structured packages of knowledge that enable multi-step tasks. A tool is an action (“run this command”); a skill bundles everything needed to accomplish a goal (“investigate this bug, form a hypothesis, write a fix”). Skills are portable across agents and loaded on demand. The tool/skill/sub-agent boundary shifts across frameworks.

Sub-agents and Orchestrators

A sub-agent is an agent called by another agent to handle a specific subtask. It has its own model and scaffold, reasons independently, and returns a result. This distinguishes it from a tool (a function call) or a skill (packaged knowledge): a sub-agent can itself reason, use tools, and call further sub-agents. The calling agent is called an orchestrator — a higher-level controller that manages agents as units, each running their own harness, rather than driving a single model through its loop.

RL Training Vocabulary

These terms are specific to training pipelines, where the agent runs through tasks, gets scored, and the model’s weights are updated:

TermDefinition
RL EnvironmentA stateful object that accepts an action, updates internal state, and returns an observation. In LLM agents, actions are typically tool calls.
TrainerRuns many agent episodes, scores them, and updates model weights. Example: GRPOTrainer in TRL — handles episode generation, reward scoring, and weight updates in one class.
RolloutOne full agent run from start to finish: what the agent saw, did, and what reward it received at each step. Also called trajectory or trace. The raw data RL algorithms learn from.
RewardThe score that tells the training algorithm whether the model is improving. Can be verifiable (tests pass/fail), or learned (human preferences, LLM-as-judge); sparse (end of episode) or dense (per step).
RubricBreaks the reward into explicit weighted dimensions rather than a single number. Frameworks like OpenEnv and Verifiers implement rubrics as composable objects (WeightedSum, Sequential, Gate).

Key Claims

  • The harness/scaffold distinction matters most in training pipelines, where they can be varied independently; at inference the blur is acceptable
  • Agent = Model + Harness is the community’s convergent framing — if you are not the model, you are the harness
  • Two products on the same underlying model behave differently because their harnesses make different choices — the product is neither the model nor the harness alone
  • Context engineering carries asymmetric risk: at training, errors require retraining; at inference, errors are just text changes

Connections to Existing Wiki Pages