Triton Inference Server Backend

NVIDIA Triton Inference Server documentation — docs.nvidia.com/deeplearning/triton-inference-server

Abstract

A Triton backend is the pluggable execution unit that runs a model within NVIDIA Triton Inference Server. Each backend is a shared library (libtriton_<name>.so) implementing the Triton Backend API, which Triton calls to initialise the model, execute inference requests, and manage lifecycle. Triton ships official backends for TensorRT, ONNX Runtime, TensorFlow, PyTorch, OpenVINO, Python, DALI, FIL, TensorRT-LLM, and vLLM. Developers can also implement custom backends in C/C++ or Python. The article covers the full backend API, object model (Backend, Model, ModelInstance, Request, Response), lifecycle management, request-response patterns (single response vs. decoupled multi-response), and how to build and install custom backends.

Supported Backends

Backend	Model formats	Use case
TensorRT	TensorRT engines	Optimised inference on NVIDIA GPUs
ONNX Runtime	ONNX models	Cross-framework portable inference
TensorFlow	GraphDef, SavedModel (TF1 + TF2)	TensorFlow model serving
PyTorch	TorchScript, PyTorch 2.0	PyTorch model serving
OpenVINO	OpenVINO IR models	CPU/Intel-optimised inference
Python	Arbitrary Python logic	Pre/post processing, custom models
DALI	DALI pipelines	GPU-accelerated data preprocessing
FIL	XGBoost, LightGBM, Scikit-Learn RF	Tree-based ML models
TensorRT-LLM	TRT-LLM engines	Optimised LLM inference
vLLM	vLLM-supported models	LLM serving via vLLM engine

Key Concepts

Backend shared library: each backend is libtriton_<name>.so; Triton searches model repo → model dir → global backend dir (/opt/tritonserver/backends/) in priority order
TRITONBACKEND_Backend: singleton shared across all models using a backend; holds backend-level state; TRITONBACKEND_Initialize / Finalize called once per process
TRITONBACKEND_Model: one per loaded model, shared across all instances; TRITONBACKEND_ModelInitialize / Finalize called on load/unload
TRITONBACKEND_ModelInstance: one per instance group entry; TRITONBACKEND_ModelInstanceExecute is the only mandatory function — called by Triton to run inference on a batch
TRITONBACKEND_Request / Response: request objects are owned by the backend during execution and must be released via TRITONBACKEND_RequestRelease; responses created via TRITONBACKEND_ResponseNew
Decoupled backend: backend that sends multiple responses per request out-of-order; uses a ResponseFactory per request; must send at least one final (completion-flagged) response per request
Python-based backends: backends implementing the TritonPythonModel interface (execute, optionally initialize / finalize); shareable across models; vLLM backend is an example
Parallel model instance loading: opt-in backend attribute that enables concurrent ModelInstanceInitialize calls, improving startup time for large instance counts (supported by Python and ONNX Runtime backends)

Backend Lifecycle

Model load triggered
  → Load backend .so (if not already loaded) → TRITONBACKEND_Initialize
  → TRITONBACKEND_ModelInitialize
  → For each instance: TRITONBACKEND_ModelInstanceInitialize

Inference request arrives
  → TRITONBACKEND_ModelInstanceExecute (batch of requests)
  → backend creates responses, releases requests, returns

Model unload triggered
  → For each instance: TRITONBACKEND_ModelInstanceFinalize
  → TRITONBACKEND_ModelFinalize
  → (backend kept until process exit) → TRITONBACKEND_Finalize

Custom Backend Development

Implement the C interface in tritonbackend.h as a shared library
Mandatory: TRITONBACKEND_ModelInstanceExecute
Recommended: Initialize, Finalize callbacks for backend, model, and instance lifecycle
Install to /opt/tritonserver/backends/<name>/libtriton_<name>.so
Declare in model config: backend: "<name>"

Terminology

Instance group: model configuration setting defining how many and what type of model instances Triton creates (e.g., 2 GPU instances, 4 CPU instances)
runtime setting: model config field (from Triton 24.01) overriding the default shared library name for a backend
Decoupled model: model that produces streaming/asynchronous responses; requires decoupled backend and ResponseFactory usage

Connections to Existing Wiki Pages

Triton Backend (Deployment and Scaling angle) — cross-section covering backend selection and deployment implications
Optimization — NVIDIA Triton Inference Server — performance optimisation guide for Triton (perf_analyzer, Model Analyzer, concurrency tuning)
Batchers — NVIDIA Triton Inference Server — dynamic batching and sequence batching that backends receive as pre-formed batches
Scaling LLMs with Triton and TensorRT-LLM — end-to-end deployment using the TensorRT-LLM backend covered here
Performance Analysis — TensorRT-LLM — profiling the TensorRT-LLM backend in production

Personal Wiki

Explorer

Triton Inference Server Backend

Triton Inference Server Backend

Abstract

Supported Backends

Key Concepts

Backend Lifecycle

Custom Backend Development

Terminology

Connections to Existing Wiki Pages

Graph View

Table of Contents

Backlinks