Batchers — NVIDIA Triton Inference Server

NVIDIA Triton Inference Server documentation — docs.nvidia.com/deeplearning/triton-inference-server

Abstract

This article covers Triton’s three batching mechanisms — Dynamic Batcher, Custom Batcher, and Sequence Batcher — which allow the server to automatically combine individual inference requests into batches for higher GPU utilisation and throughput. The Dynamic Batcher is the primary tool for stateless models, assembling batches from queued requests up to a configured maximum size, with optional delay to improve fill rate. The Sequence Batcher handles stateful models, routing all requests belonging to a session to the same model instance to preserve state. Custom batching extends the Dynamic Batcher with user-defined C functions that control which requests can be grouped together.

Dynamic Batcher

The Dynamic Batcher enables server-side batching for stateless models without requiring clients to pre-form batches. It is enabled per model via ModelDynamicBatching in the model config.

Tuning Process

Set maximum_batch_size in model config
Enable with dynamic_batching { } (all defaults)
Profile with Performance Analyzer to measure latency and throughput
If within latency budget, increase max_queue_delay_microseconds to improve batch fill rate
Use Model Analyzer for automated search over batching configurations

Key Configuration Parameters

Parameter	Effect
`preferred_batch_size`	Batch sizes to prefer; only set if specific sizes give significantly better performance (e.g., TensorRT multi-profile models)
`max_queue_delay_microseconds`	Max time a request can wait in queue to join a larger batch; trade latency for throughput
`preserve_ordering`	Force responses to be returned in request-arrival order
`priority_levels`	Define multiple queues so high-priority requests bypass lower-priority ones
`default_priority_level`	Priority assigned to requests that do not specify one

Queue Policy (`ModelQueuePolicy`)

max_queue_size: cap on number of requests waiting in queue
timeout_action: reject or defer requests that wait longer than default_timeout_microseconds
allow_timeout_override: let individual requests specify their own timeout

Custom Batching

Extends the Dynamic Batcher with user-defined batching rules implemented as a shared library. Implement five C functions from tritonbackend.h:

Function	Purpose
`TRITONBACKEND_ModelBatchIncludeRequest`	Decide whether a request joins the current batch
`TRITONBACKEND_ModelBatchInitialize`	Initialise per-batch record-keeping structure
`TRITONBACKEND_ModelBatchFinalize`	Free per-batch structure after batch is formed
`TRITONBACKEND_ModelBatcherInitialize`	Initialise read-only data used across all batches
`TRITONBACKEND_ModelBatcherFinalize`	Free read-only data when model is unloaded

Library path set via TRITON_BATCH_STRATEGY_PATH in model config, or named batchstrategy.so and placed in the model/backend directory for auto-discovery.

Sequence Batcher

The Sequence Batcher handles stateful models where all requests in a session must reach the same model instance. It creates dynamic batches like the Dynamic Batcher but adds session routing. Configured via ModelSequenceBatching in model config, which controls:

Sequence timeout
Control signals (sequence start, end, ready flags)
Correlation ID routing (maps session identifiers to model instances)

Terminology

Stateless model: model with no per-session memory; any request can go to any instance (Dynamic Batcher)
Stateful model: model that maintains state across requests in a session; all requests for a session must reach the same instance (Sequence Batcher)
Preferred batch size: batch size(s) at which a model performs significantly better; most models should not set this — only for TensorRT multi-profile cases
Batch fill rate: fraction of the maximum batch size actually used; low fill rates waste GPU capacity

Connections to Existing Wiki Pages

Batchers (Deployment and Scaling angle) — cross-section page covering throughput/latency trade-offs from an operations perspective
Triton Inference Server Backend — backend API that receives the pre-formed batches produced by the Dynamic Batcher
Optimization — NVIDIA Triton Inference Server — perf_analyzer and Model Analyzer tools referenced in the dynamic batcher tuning process
Scaling LLMs with Triton and TensorRT-LLM — production deployment where dynamic batching is a core throughput lever

Personal Wiki

Explorer

Batchers — NVIDIA Triton Inference Server

Batchers — NVIDIA Triton Inference Server

Abstract

Dynamic Batcher

Tuning Process

Key Configuration Parameters

Queue Policy (`ModelQueuePolicy`)

Custom Batching

Sequence Batcher

Terminology

Connections to Existing Wiki Pages

Graph View

Table of Contents

Backlinks

Personal Wiki

Explorer

Batchers — NVIDIA Triton Inference Server

Batchers — NVIDIA Triton Inference Server

Abstract

Dynamic Batcher

Tuning Process

Key Configuration Parameters

Queue Policy (ModelQueuePolicy)

Custom Batching

Sequence Batcher

Terminology

Connections to Existing Wiki Pages

Graph View

Table of Contents

Backlinks

Queue Policy (`ModelQueuePolicy`)