Triton Inference Server Backend
NVIDIA Triton Inference Server documentation — docs.nvidia.com/deeplearning/triton-inference-server
Abstract
A Triton backend is the pluggable execution unit that runs a model within NVIDIA Triton Inference Server. Each backend is a shared library (libtriton_<name>.so) implementing the Triton Backend API, which Triton calls to initialise the model, execute inference requests, and manage lifecycle. Triton ships official backends for TensorRT, ONNX Runtime, TensorFlow, PyTorch, OpenVINO, Python, DALI, FIL, TensorRT-LLM, and vLLM. Developers can also implement custom backends in C/C++ or Python. The article covers the full backend API, object model (Backend, Model, ModelInstance, Request, Response), lifecycle management, request-response patterns (single response vs. decoupled multi-response), and how to build and install custom backends.
Supported Backends
| Backend | Model formats | Use case |
|---|---|---|
| TensorRT | TensorRT engines | Optimised inference on NVIDIA GPUs |
| ONNX Runtime | ONNX models | Cross-framework portable inference |
| TensorFlow | GraphDef, SavedModel (TF1 + TF2) | TensorFlow model serving |
| PyTorch | TorchScript, PyTorch 2.0 | PyTorch model serving |
| OpenVINO | OpenVINO IR models | CPU/Intel-optimised inference |
| Python | Arbitrary Python logic | Pre/post processing, custom models |
| DALI | DALI pipelines | GPU-accelerated data preprocessing |
| FIL | XGBoost, LightGBM, Scikit-Learn RF | Tree-based ML models |
| TensorRT-LLM | TRT-LLM engines | Optimised LLM inference |
| vLLM | vLLM-supported models | LLM serving via vLLM engine |
Key Concepts
- Backend shared library: each backend is
libtriton_<name>.so; Triton searches model repo → model dir → global backend dir (/opt/tritonserver/backends/) in priority order - TRITONBACKEND_Backend: singleton shared across all models using a backend; holds backend-level state;
TRITONBACKEND_Initialize/Finalizecalled once per process - TRITONBACKEND_Model: one per loaded model, shared across all instances;
TRITONBACKEND_ModelInitialize/Finalizecalled on load/unload - TRITONBACKEND_ModelInstance: one per instance group entry;
TRITONBACKEND_ModelInstanceExecuteis the only mandatory function — called by Triton to run inference on a batch - TRITONBACKEND_Request / Response: request objects are owned by the backend during execution and must be released via
TRITONBACKEND_RequestRelease; responses created viaTRITONBACKEND_ResponseNew - Decoupled backend: backend that sends multiple responses per request out-of-order; uses a
ResponseFactoryper request; must send at least one final (completion-flagged) response per request - Python-based backends: backends implementing the
TritonPythonModelinterface (execute, optionallyinitialize/finalize); shareable across models; vLLM backend is an example - Parallel model instance loading: opt-in backend attribute that enables concurrent
ModelInstanceInitializecalls, improving startup time for large instance counts (supported by Python and ONNX Runtime backends)
Backend Lifecycle
Model load triggered
→ Load backend .so (if not already loaded) → TRITONBACKEND_Initialize
→ TRITONBACKEND_ModelInitialize
→ For each instance: TRITONBACKEND_ModelInstanceInitialize
Inference request arrives
→ TRITONBACKEND_ModelInstanceExecute (batch of requests)
→ backend creates responses, releases requests, returns
Model unload triggered
→ For each instance: TRITONBACKEND_ModelInstanceFinalize
→ TRITONBACKEND_ModelFinalize
→ (backend kept until process exit) → TRITONBACKEND_Finalize
Custom Backend Development
- Implement the C interface in
tritonbackend.has a shared library - Mandatory:
TRITONBACKEND_ModelInstanceExecute - Recommended:
Initialize,Finalizecallbacks for backend, model, and instance lifecycle - Install to
/opt/tritonserver/backends/<name>/libtriton_<name>.so - Declare in model config:
backend: "<name>"
Terminology
- Instance group: model configuration setting defining how many and what type of model instances Triton creates (e.g., 2 GPU instances, 4 CPU instances)
runtimesetting: model config field (from Triton 24.01) overriding the default shared library name for a backend- Decoupled model: model that produces streaming/asynchronous responses; requires decoupled backend and
ResponseFactoryusage
Connections to Existing Wiki Pages
- Triton Backend (Deployment and Scaling angle) — cross-section covering backend selection and deployment implications
- Optimization — NVIDIA Triton Inference Server — performance optimisation guide for Triton (perf_analyzer, Model Analyzer, concurrency tuning)
- Batchers — NVIDIA Triton Inference Server — dynamic batching and sequence batching that backends receive as pre-formed batches
- Scaling LLMs with Triton and TensorRT-LLM — end-to-end deployment using the TensorRT-LLM backend covered here
- Performance Analysis — TensorRT-LLM — profiling the TensorRT-LLM backend in production