Optimization — NVIDIA Triton Inference Server

Cross-posted to Deployment and Scaling and NVIDIA Platform Implementation.

Abstract

This NVIDIA documentation article describes optimization techniques available in the Triton Inference Server for reducing latency and increasing throughput. Using an ONNX Inception model as a running example and the perf_analyzer tool for benchmarking, it introduces three primary optimization levers — dynamic batching, model instance groups, and framework-specific accelerators (TensorRT, OpenVINO) — and a fourth, hardware-level NUMA policy configuration. The article demonstrates quantitatively that enabling TensorRT can double throughput and halve latency; that dynamic batching alone achieves near-optimal throughput; and that combining dynamic batching with multiple model instances is model-specific and may not compound the gains. It provides practical rules of thumb for setting concurrency to minimize latency or maximize throughput.

Key Concepts

Dynamic Batcher

The dynamic batcher aggregates individual inference requests into a single larger batch before executing them on the GPU. This improves GPU utilization by amortizing fixed dispatch overhead over more work. Enable it by adding dynamic_batching { } to the model configuration file.

Maximum-throughput concurrency rule: set perf_analyzer request concurrency to 2 × max_batch_size × model_instance_count. The factor of 2 allows Triton to overlap communication of one batch with computation of the next.

Minimum-latency rule: set concurrency to 1, disable dynamic batching, use 1 model instance.

Model Instance Groups

Multiple copies of a model can be loaded simultaneously (controlled via instance_group [{ count: N }]). Two instances improve throughput by overlapping CPU↔GPU memory transfers with inference compute. For the Inception benchmark, one instance at 73 inferences/s became ~110 inferences/s with two instances. Adding instances beyond what the GPU can saturate yields diminishing returns.

TensorRT Optimization (ORT-TRT)

ONNX models running on GPU can be compiled by TensorRT for hardware-specific kernel selection and precision reduction. Configured via the optimization { execution_accelerators { gpu_execution_accelerator ... } } block in the model config. FP16 precision (precision_mode: FP16) delivered ~2× throughput improvement with ~50% latency reduction on the DenseNet benchmark. Caveat: TensorRT compilation dramatically increases model load time; use model warmup in production to hide startup delay.

OpenVINO Optimization

ONNX models running on CPU can be accelerated by Intel OpenVINO via cpu_execution_accelerator. No benchmarks are provided; the pattern mirrors the GPU TensorRT approach.

NUMA Host Policies

Modern servers with multiple NUMA nodes expose non-uniform memory access patterns. Triton can bind model instances to a specific NUMA node and CPU-core range (--host-policy=<name>,numa-node=N,cpu-cores=A-B), then assign instances to that policy via instance_group. This reduces cross-NUMA memory latency and improves throughput predictability on multi-socket machines — relevant when serving large models on DGX-class systems.

Key Algorithms / Configurations

Mechanism	Config key	What it does
Dynamic batching	`dynamic_batching {}`	Merges requests before dispatch
Model instances	`instance_group [{ count: N }]`	Runs N parallel model copies
TensorRT FP16	`gpu_execution_accelerator: tensorrt, precision_mode: FP16`	Recompiles ONNX to TRT engine
OpenVINO	`cpu_execution_accelerator: openvino`	CPU-side inference acceleration
NUMA policy	`--host-policy` CLI flag + `instance_group`	Binds instance to NUMA node/CPU cores

Key Claims and Findings

Baseline (no optimization): ~73 inferences/s at 4 concurrent requests.
Dynamic batching (8 concurrent): ~272 inferences/s — roughly 4× improvement with comparable latency.
Two model instances (no dynamic batching): ~110 inferences/s — ~1.5× improvement.
Dynamic batching + two instances: ~290 inferences/s, but at higher latency; the GPU was already saturated by the dynamic batcher alone.
TensorRT FP16 on DenseNet: ~2× throughput, ~0.5× latency versus unoptimized ONNX.
Combining dynamic batching and multiple instances is model-specific: when the GPU is already saturated by batching, adding instances increases latency without throughput gain.
For maximum throughput: use perf_analyzer to find the optimal setting — do not assume additive gains.

Terminology

Term	Definition
`perf_analyzer`	Triton client tool for latency/throughput benchmarking across concurrency ranges
Model Analyzer	Separate NVIDIA tool for understanding GPU memory utilization across multiple models
Dynamic batcher	Server-side component that queues requests and batches them before inference dispatch
Instance group	Config construct specifying how many copies of a model to load
ORT-TRT	ONNX Runtime + TensorRT — combined inference path for GPU-accelerated ONNX
NUMA	Non-Uniform Memory Access — memory architecture affecting latency depending on which socket a thread runs on
Model warmup	Config option that runs inference before the server reports ready, hiding TRT compilation latency

Connections

Building Autonomous AI with NVIDIA Agentic NeMo — identifies Triton Inference Server as the serving layer in NVIDIA’s agentic stack; the optimization techniques here directly govern throughput and latency for any agent using NVIDIA-hosted model endpoints
Cross-section: Deployment and Scaling — deployment perspective on the same material
Cross-section: NVIDIA Platform Implementation — NVIDIA-platform perspective on the same material

Personal Wiki

Explorer

Optimization — NVIDIA Triton Inference Server

Optimization — NVIDIA Triton Inference Server

Abstract

Key Concepts

Dynamic Batcher

Model Instance Groups

TensorRT Optimization (ORT-TRT)

OpenVINO Optimization

NUMA Host Policies

Key Algorithms / Configurations

Key Claims and Findings

Terminology

Connections

Graph View

Table of Contents

Backlinks