Batchers — NVIDIA Triton Inference Server
NVIDIA Triton Inference Server documentation — docs.nvidia.com/deeplearning/triton-inference-server
Abstract
This article covers Triton’s three batching mechanisms — Dynamic Batcher, Custom Batcher, and Sequence Batcher — which allow the server to automatically combine individual inference requests into batches for higher GPU utilisation and throughput. The Dynamic Batcher is the primary tool for stateless models, assembling batches from queued requests up to a configured maximum size, with optional delay to improve fill rate. The Sequence Batcher handles stateful models, routing all requests belonging to a session to the same model instance to preserve state. Custom batching extends the Dynamic Batcher with user-defined C functions that control which requests can be grouped together.
Dynamic Batcher
The Dynamic Batcher enables server-side batching for stateless models without requiring clients to pre-form batches. It is enabled per model via ModelDynamicBatching in the model config.
Tuning Process
- Set
maximum_batch_sizein model config - Enable with
dynamic_batching { }(all defaults) - Profile with Performance Analyzer to measure latency and throughput
- If within latency budget, increase
max_queue_delay_microsecondsto improve batch fill rate - Use Model Analyzer for automated search over batching configurations
Key Configuration Parameters
| Parameter | Effect |
|---|---|
preferred_batch_size | Batch sizes to prefer; only set if specific sizes give significantly better performance (e.g., TensorRT multi-profile models) |
max_queue_delay_microseconds | Max time a request can wait in queue to join a larger batch; trade latency for throughput |
preserve_ordering | Force responses to be returned in request-arrival order |
priority_levels | Define multiple queues so high-priority requests bypass lower-priority ones |
default_priority_level | Priority assigned to requests that do not specify one |
Queue Policy (ModelQueuePolicy)
max_queue_size: cap on number of requests waiting in queuetimeout_action: reject or defer requests that wait longer thandefault_timeout_microsecondsallow_timeout_override: let individual requests specify their own timeout
Custom Batching
Extends the Dynamic Batcher with user-defined batching rules implemented as a shared library. Implement five C functions from tritonbackend.h:
| Function | Purpose |
|---|---|
TRITONBACKEND_ModelBatchIncludeRequest | Decide whether a request joins the current batch |
TRITONBACKEND_ModelBatchInitialize | Initialise per-batch record-keeping structure |
TRITONBACKEND_ModelBatchFinalize | Free per-batch structure after batch is formed |
TRITONBACKEND_ModelBatcherInitialize | Initialise read-only data used across all batches |
TRITONBACKEND_ModelBatcherFinalize | Free read-only data when model is unloaded |
Library path set via TRITON_BATCH_STRATEGY_PATH in model config, or named batchstrategy.so and placed in the model/backend directory for auto-discovery.
Sequence Batcher
The Sequence Batcher handles stateful models where all requests in a session must reach the same model instance. It creates dynamic batches like the Dynamic Batcher but adds session routing. Configured via ModelSequenceBatching in model config, which controls:
- Sequence timeout
- Control signals (sequence start, end, ready flags)
- Correlation ID routing (maps session identifiers to model instances)
Terminology
- Stateless model: model with no per-session memory; any request can go to any instance (Dynamic Batcher)
- Stateful model: model that maintains state across requests in a session; all requests for a session must reach the same instance (Sequence Batcher)
- Preferred batch size: batch size(s) at which a model performs significantly better; most models should not set this — only for TensorRT multi-profile cases
- Batch fill rate: fraction of the maximum batch size actually used; low fill rates waste GPU capacity
Connections to Existing Wiki Pages
- Batchers (Deployment and Scaling angle) — cross-section page covering throughput/latency trade-offs from an operations perspective
- Triton Inference Server Backend — backend API that receives the pre-formed batches produced by the Dynamic Batcher
- Optimization — NVIDIA Triton Inference Server — perf_analyzer and Model Analyzer tools referenced in the dynamic batcher tuning process
- Scaling LLMs with Triton and TensorRT-LLM — production deployment where dynamic batching is a core throughput lever