Optimization — NVIDIA Triton Inference Server
Cross-posted to Deployment and Scaling and NVIDIA Platform Implementation.
Abstract
This NVIDIA documentation article describes optimization techniques available in the Triton Inference Server for reducing latency and increasing throughput. Using an ONNX Inception model as a running example and the perf_analyzer tool for benchmarking, it introduces three primary optimization levers — dynamic batching, model instance groups, and framework-specific accelerators (TensorRT, OpenVINO) — and a fourth, hardware-level NUMA policy configuration. The article demonstrates quantitatively that enabling TensorRT can double throughput and halve latency; that dynamic batching alone achieves near-optimal throughput; and that combining dynamic batching with multiple model instances is model-specific and may not compound the gains. It provides practical rules of thumb for setting concurrency to minimize latency or maximize throughput.
Key Concepts
Dynamic Batcher
The dynamic batcher aggregates individual inference requests into a single larger batch before executing them on the GPU. This improves GPU utilization by amortizing fixed dispatch overhead over more work. Enable it by adding dynamic_batching { } to the model configuration file.
Maximum-throughput concurrency rule: set perf_analyzer request concurrency to 2 × max_batch_size × model_instance_count. The factor of 2 allows Triton to overlap communication of one batch with computation of the next.
Minimum-latency rule: set concurrency to 1, disable dynamic batching, use 1 model instance.
Model Instance Groups
Multiple copies of a model can be loaded simultaneously (controlled via instance_group [{ count: N }]). Two instances improve throughput by overlapping CPU↔GPU memory transfers with inference compute. For the Inception benchmark, one instance at 73 inferences/s became ~110 inferences/s with two instances. Adding instances beyond what the GPU can saturate yields diminishing returns.
TensorRT Optimization (ORT-TRT)
ONNX models running on GPU can be compiled by TensorRT for hardware-specific kernel selection and precision reduction. Configured via the optimization { execution_accelerators { gpu_execution_accelerator ... } } block in the model config. FP16 precision (precision_mode: FP16) delivered ~2× throughput improvement with ~50% latency reduction on the DenseNet benchmark. Caveat: TensorRT compilation dramatically increases model load time; use model warmup in production to hide startup delay.
OpenVINO Optimization
ONNX models running on CPU can be accelerated by Intel OpenVINO via cpu_execution_accelerator. No benchmarks are provided; the pattern mirrors the GPU TensorRT approach.
NUMA Host Policies
Modern servers with multiple NUMA nodes expose non-uniform memory access patterns. Triton can bind model instances to a specific NUMA node and CPU-core range (--host-policy=<name>,numa-node=N,cpu-cores=A-B), then assign instances to that policy via instance_group. This reduces cross-NUMA memory latency and improves throughput predictability on multi-socket machines — relevant when serving large models on DGX-class systems.
Key Algorithms / Configurations
| Mechanism | Config key | What it does |
|---|---|---|
| Dynamic batching | dynamic_batching {} | Merges requests before dispatch |
| Model instances | instance_group [{ count: N }] | Runs N parallel model copies |
| TensorRT FP16 | gpu_execution_accelerator: tensorrt, precision_mode: FP16 | Recompiles ONNX to TRT engine |
| OpenVINO | cpu_execution_accelerator: openvino | CPU-side inference acceleration |
| NUMA policy | --host-policy CLI flag + instance_group | Binds instance to NUMA node/CPU cores |
Key Claims and Findings
- Baseline (no optimization): ~73 inferences/s at 4 concurrent requests.
- Dynamic batching (8 concurrent): ~272 inferences/s — roughly 4× improvement with comparable latency.
- Two model instances (no dynamic batching): ~110 inferences/s — ~1.5× improvement.
- Dynamic batching + two instances: ~290 inferences/s, but at higher latency; the GPU was already saturated by the dynamic batcher alone.
- TensorRT FP16 on DenseNet: ~2× throughput, ~0.5× latency versus unoptimized ONNX.
- Combining dynamic batching and multiple instances is model-specific: when the GPU is already saturated by batching, adding instances increases latency without throughput gain.
- For maximum throughput: use
perf_analyzerto find the optimal setting — do not assume additive gains.
Terminology
| Term | Definition |
|---|---|
perf_analyzer | Triton client tool for latency/throughput benchmarking across concurrency ranges |
| Model Analyzer | Separate NVIDIA tool for understanding GPU memory utilization across multiple models |
| Dynamic batcher | Server-side component that queues requests and batches them before inference dispatch |
| Instance group | Config construct specifying how many copies of a model to load |
| ORT-TRT | ONNX Runtime + TensorRT — combined inference path for GPU-accelerated ONNX |
| NUMA | Non-Uniform Memory Access — memory architecture affecting latency depending on which socket a thread runs on |
| Model warmup | Config option that runs inference before the server reports ready, hiding TRT compilation latency |
Connections
- Building Autonomous AI with NVIDIA Agentic NeMo — identifies Triton Inference Server as the serving layer in NVIDIA’s agentic stack; the optimization techniques here directly govern throughput and latency for any agent using NVIDIA-hosted model endpoints
- Cross-section: Deployment and Scaling — deployment perspective on the same material
- Cross-section: NVIDIA Platform Implementation — NVIDIA-platform perspective on the same material