Batchers — NVIDIA Triton Inference Server
Cross-section page — Deployment and Scaling angle. See primary page for full summary including custom batching API and sequence batcher details.
Deployment and Scaling Angle
Triton’s batching mechanisms are a key lever for maximising GPU utilisation under real-world inference traffic. From a deployment and scaling perspective, the critical trade-off is latency vs. throughput: larger batches improve GPU utilisation but increase individual request latency.
Dynamic Batching as a Throughput Multiplier
Without server-side batching, each inference request occupies GPU resources independently, leaving the GPU partially idle between requests. The Dynamic Batcher aggregates concurrent requests into batches automatically, achieving throughput closer to the GPU’s theoretical batch throughput without requiring changes to client code.
Tuning the Latency-Throughput Trade-Off
max_queue_delay_microseconds is the primary knob: it allows the batcher to wait briefly for more requests before dispatching a sub-maximum batch. Setting it to zero prioritises minimum latency (dispatch as soon as any request arrives); higher values improve fill rate and throughput at the cost of per-request latency. In production, the right value depends on the service-level objective (SLO) for p99 latency.
Priority Queues for SLA Enforcement
priority_levels enables multi-tier service delivery: premium/real-time requests bypass a lower-priority background queue. This supports deployment architectures where interactive user traffic and batch inference workloads share the same Triton instance.
Stateful Agent Deployments: Sequence Batcher
For agentic AI applications that maintain conversational state across turns, the Sequence Batcher is essential: it routes all requests from a session to the same model instance, preserving the in-memory state that the model needs to continue a conversation. Without this, conversational agents running on multi-instance Triton deployments would lose context between requests.
Connections
- Triton Backend (Deployment and Scaling) — backend that receives the batches formed by the Dynamic Batcher
- Optimization — NVIDIA Triton Inference Server — perf_analyzer for measuring the latency-throughput curve under different batcher configurations
- Scaling LLMs with Triton and TensorRT-LLM Using Kubernetes — horizontal scaling that complements vertical batching optimisation