Optimization — NVIDIA Triton Inference Server

Cross-posted to Deployment and Scaling and NVIDIA Platform Implementation.

Abstract

This NVIDIA documentation article describes optimization techniques available in the Triton Inference Server for reducing latency and increasing throughput. Using an ONNX Inception model as a running example and the perf_analyzer tool for benchmarking, it introduces three primary optimization levers — dynamic batching, model instance groups, and framework-specific accelerators (TensorRT, OpenVINO) — and a fourth, hardware-level NUMA policy configuration. The article demonstrates quantitatively that enabling TensorRT can double throughput and halve latency; that dynamic batching alone achieves near-optimal throughput; and that combining dynamic batching with multiple model instances is model-specific and may not compound the gains. It provides practical rules of thumb for setting concurrency to minimize latency or maximize throughput.


Key Concepts

Dynamic Batcher

The dynamic batcher aggregates individual inference requests into a single larger batch before executing them on the GPU. This improves GPU utilization by amortizing fixed dispatch overhead over more work. Enable it by adding dynamic_batching { } to the model configuration file.

Maximum-throughput concurrency rule: set perf_analyzer request concurrency to 2 × max_batch_size × model_instance_count. The factor of 2 allows Triton to overlap communication of one batch with computation of the next.

Minimum-latency rule: set concurrency to 1, disable dynamic batching, use 1 model instance.

Model Instance Groups

Multiple copies of a model can be loaded simultaneously (controlled via instance_group [{ count: N }]). Two instances improve throughput by overlapping CPU↔GPU memory transfers with inference compute. For the Inception benchmark, one instance at 73 inferences/s became ~110 inferences/s with two instances. Adding instances beyond what the GPU can saturate yields diminishing returns.

TensorRT Optimization (ORT-TRT)

ONNX models running on GPU can be compiled by TensorRT for hardware-specific kernel selection and precision reduction. Configured via the optimization { execution_accelerators { gpu_execution_accelerator ... } } block in the model config. FP16 precision (precision_mode: FP16) delivered ~2× throughput improvement with ~50% latency reduction on the DenseNet benchmark. Caveat: TensorRT compilation dramatically increases model load time; use model warmup in production to hide startup delay.

OpenVINO Optimization

ONNX models running on CPU can be accelerated by Intel OpenVINO via cpu_execution_accelerator. No benchmarks are provided; the pattern mirrors the GPU TensorRT approach.

NUMA Host Policies

Modern servers with multiple NUMA nodes expose non-uniform memory access patterns. Triton can bind model instances to a specific NUMA node and CPU-core range (--host-policy=<name>,numa-node=N,cpu-cores=A-B), then assign instances to that policy via instance_group. This reduces cross-NUMA memory latency and improves throughput predictability on multi-socket machines — relevant when serving large models on DGX-class systems.


Key Algorithms / Configurations

MechanismConfig keyWhat it does
Dynamic batchingdynamic_batching {}Merges requests before dispatch
Model instancesinstance_group [{ count: N }]Runs N parallel model copies
TensorRT FP16gpu_execution_accelerator: tensorrt, precision_mode: FP16Recompiles ONNX to TRT engine
OpenVINOcpu_execution_accelerator: openvinoCPU-side inference acceleration
NUMA policy--host-policy CLI flag + instance_groupBinds instance to NUMA node/CPU cores

Key Claims and Findings

  • Baseline (no optimization): ~73 inferences/s at 4 concurrent requests.
  • Dynamic batching (8 concurrent): ~272 inferences/s — roughly 4× improvement with comparable latency.
  • Two model instances (no dynamic batching): ~110 inferences/s — ~1.5× improvement.
  • Dynamic batching + two instances: ~290 inferences/s, but at higher latency; the GPU was already saturated by the dynamic batcher alone.
  • TensorRT FP16 on DenseNet: ~2× throughput, ~0.5× latency versus unoptimized ONNX.
  • Combining dynamic batching and multiple instances is model-specific: when the GPU is already saturated by batching, adding instances increases latency without throughput gain.
  • For maximum throughput: use perf_analyzer to find the optimal setting — do not assume additive gains.

Terminology

TermDefinition
perf_analyzerTriton client tool for latency/throughput benchmarking across concurrency ranges
Model AnalyzerSeparate NVIDIA tool for understanding GPU memory utilization across multiple models
Dynamic batcherServer-side component that queues requests and batches them before inference dispatch
Instance groupConfig construct specifying how many copies of a model to load
ORT-TRTONNX Runtime + TensorRT — combined inference path for GPU-accelerated ONNX
NUMANon-Uniform Memory Access — memory architecture affecting latency depending on which socket a thread runs on
Model warmupConfig option that runs inference before the server reports ready, hiding TRT compilation latency

Connections