Batchers — NVIDIA Triton Inference Server

NVIDIA Triton Inference Server documentation — docs.nvidia.com/deeplearning/triton-inference-server

Abstract

This article covers Triton’s three batching mechanisms — Dynamic Batcher, Custom Batcher, and Sequence Batcher — which allow the server to automatically combine individual inference requests into batches for higher GPU utilisation and throughput. The Dynamic Batcher is the primary tool for stateless models, assembling batches from queued requests up to a configured maximum size, with optional delay to improve fill rate. The Sequence Batcher handles stateful models, routing all requests belonging to a session to the same model instance to preserve state. Custom batching extends the Dynamic Batcher with user-defined C functions that control which requests can be grouped together.

Dynamic Batcher

The Dynamic Batcher enables server-side batching for stateless models without requiring clients to pre-form batches. It is enabled per model via ModelDynamicBatching in the model config.

Tuning Process

  1. Set maximum_batch_size in model config
  2. Enable with dynamic_batching { } (all defaults)
  3. Profile with Performance Analyzer to measure latency and throughput
  4. If within latency budget, increase max_queue_delay_microseconds to improve batch fill rate
  5. Use Model Analyzer for automated search over batching configurations

Key Configuration Parameters

ParameterEffect
preferred_batch_sizeBatch sizes to prefer; only set if specific sizes give significantly better performance (e.g., TensorRT multi-profile models)
max_queue_delay_microsecondsMax time a request can wait in queue to join a larger batch; trade latency for throughput
preserve_orderingForce responses to be returned in request-arrival order
priority_levelsDefine multiple queues so high-priority requests bypass lower-priority ones
default_priority_levelPriority assigned to requests that do not specify one

Queue Policy (ModelQueuePolicy)

  • max_queue_size: cap on number of requests waiting in queue
  • timeout_action: reject or defer requests that wait longer than default_timeout_microseconds
  • allow_timeout_override: let individual requests specify their own timeout

Custom Batching

Extends the Dynamic Batcher with user-defined batching rules implemented as a shared library. Implement five C functions from tritonbackend.h:

FunctionPurpose
TRITONBACKEND_ModelBatchIncludeRequestDecide whether a request joins the current batch
TRITONBACKEND_ModelBatchInitializeInitialise per-batch record-keeping structure
TRITONBACKEND_ModelBatchFinalizeFree per-batch structure after batch is formed
TRITONBACKEND_ModelBatcherInitializeInitialise read-only data used across all batches
TRITONBACKEND_ModelBatcherFinalizeFree read-only data when model is unloaded

Library path set via TRITON_BATCH_STRATEGY_PATH in model config, or named batchstrategy.so and placed in the model/backend directory for auto-discovery.

Sequence Batcher

The Sequence Batcher handles stateful models where all requests in a session must reach the same model instance. It creates dynamic batches like the Dynamic Batcher but adds session routing. Configured via ModelSequenceBatching in model config, which controls:

  • Sequence timeout
  • Control signals (sequence start, end, ready flags)
  • Correlation ID routing (maps session identifiers to model instances)

Terminology

  • Stateless model: model with no per-session memory; any request can go to any instance (Dynamic Batcher)
  • Stateful model: model that maintains state across requests in a session; all requests for a session must reach the same instance (Sequence Batcher)
  • Preferred batch size: batch size(s) at which a model performs significantly better; most models should not set this — only for TensorRT multi-profile cases
  • Batch fill rate: fraction of the maximum batch size actually used; low fill rates waste GPU capacity

Connections to Existing Wiki Pages