Triton Inference Server Backend

Cross-section page — Deployment and Scaling angle. See primary page for the full summary, backend API reference, and lifecycle details.

Deployment and Scaling Angle

Triton backend selection is a foundational deployment decision because it determines which model formats are supported, what hardware acceleration is available, and how model instances scale under concurrent load.

Backend Selection for Production Deployment

Deployment goalRecommended backend
Maximum GPU throughput for LLMsTensorRT-LLM backend (TRT optimisation + paged KV cache)
Cross-framework compatibilityONNX Runtime backend
GPU-accelerated LLM serving (alternative)vLLM backend
Custom pre/post-processing or Python logicPython backend
GPU-accelerated data preprocessingDALI backend
Tree-based ML models (XGBoost, LightGBM)FIL backend

The TensorRT-LLM backend is the primary choice for LLM serving at scale, as it integrates directly with TensorRT-LLM’s continuous batching and paged attention. The vLLM backend is an alternative via the Python backend dependency.

Parallel Instance Loading

Backends that support TRITONBACKEND_BackendAttributeSetParallelModelInstanceLoading (Python, ONNX Runtime) allow Triton to initialise multiple model instances concurrently, reducing startup time when deploying high-instance-count configurations. This is directly relevant to fast scale-out on large clusters.

Decoupled Backends for Streaming

Decoupled backends support streaming inference (multiple responses per request) via ResponseFactory. This enables real-time token streaming in LLM deployments without blocking the client until generation is complete.

Connections