Triton Inference Server Backend

NVIDIA Triton Inference Server documentation — docs.nvidia.com/deeplearning/triton-inference-server

Abstract

A Triton backend is the pluggable execution unit that runs a model within NVIDIA Triton Inference Server. Each backend is a shared library (libtriton_<name>.so) implementing the Triton Backend API, which Triton calls to initialise the model, execute inference requests, and manage lifecycle. Triton ships official backends for TensorRT, ONNX Runtime, TensorFlow, PyTorch, OpenVINO, Python, DALI, FIL, TensorRT-LLM, and vLLM. Developers can also implement custom backends in C/C++ or Python. The article covers the full backend API, object model (Backend, Model, ModelInstance, Request, Response), lifecycle management, request-response patterns (single response vs. decoupled multi-response), and how to build and install custom backends.

Supported Backends

BackendModel formatsUse case
TensorRTTensorRT enginesOptimised inference on NVIDIA GPUs
ONNX RuntimeONNX modelsCross-framework portable inference
TensorFlowGraphDef, SavedModel (TF1 + TF2)TensorFlow model serving
PyTorchTorchScript, PyTorch 2.0PyTorch model serving
OpenVINOOpenVINO IR modelsCPU/Intel-optimised inference
PythonArbitrary Python logicPre/post processing, custom models
DALIDALI pipelinesGPU-accelerated data preprocessing
FILXGBoost, LightGBM, Scikit-Learn RFTree-based ML models
TensorRT-LLMTRT-LLM enginesOptimised LLM inference
vLLMvLLM-supported modelsLLM serving via vLLM engine

Key Concepts

  • Backend shared library: each backend is libtriton_<name>.so; Triton searches model repo → model dir → global backend dir (/opt/tritonserver/backends/) in priority order
  • TRITONBACKEND_Backend: singleton shared across all models using a backend; holds backend-level state; TRITONBACKEND_Initialize / Finalize called once per process
  • TRITONBACKEND_Model: one per loaded model, shared across all instances; TRITONBACKEND_ModelInitialize / Finalize called on load/unload
  • TRITONBACKEND_ModelInstance: one per instance group entry; TRITONBACKEND_ModelInstanceExecute is the only mandatory function — called by Triton to run inference on a batch
  • TRITONBACKEND_Request / Response: request objects are owned by the backend during execution and must be released via TRITONBACKEND_RequestRelease; responses created via TRITONBACKEND_ResponseNew
  • Decoupled backend: backend that sends multiple responses per request out-of-order; uses a ResponseFactory per request; must send at least one final (completion-flagged) response per request
  • Python-based backends: backends implementing the TritonPythonModel interface (execute, optionally initialize / finalize); shareable across models; vLLM backend is an example
  • Parallel model instance loading: opt-in backend attribute that enables concurrent ModelInstanceInitialize calls, improving startup time for large instance counts (supported by Python and ONNX Runtime backends)

Backend Lifecycle

Model load triggered
  → Load backend .so (if not already loaded) → TRITONBACKEND_Initialize
  → TRITONBACKEND_ModelInitialize
  → For each instance: TRITONBACKEND_ModelInstanceInitialize

Inference request arrives
  → TRITONBACKEND_ModelInstanceExecute (batch of requests)
  → backend creates responses, releases requests, returns

Model unload triggered
  → For each instance: TRITONBACKEND_ModelInstanceFinalize
  → TRITONBACKEND_ModelFinalize
  → (backend kept until process exit) → TRITONBACKEND_Finalize

Custom Backend Development

  1. Implement the C interface in tritonbackend.h as a shared library
  2. Mandatory: TRITONBACKEND_ModelInstanceExecute
  3. Recommended: Initialize, Finalize callbacks for backend, model, and instance lifecycle
  4. Install to /opt/tritonserver/backends/<name>/libtriton_<name>.so
  5. Declare in model config: backend: "<name>"

Terminology

  • Instance group: model configuration setting defining how many and what type of model instances Triton creates (e.g., 2 GPU instances, 4 CPU instances)
  • runtime setting: model config field (from Triton 24.01) overriding the default shared library name for a backend
  • Decoupled model: model that produces streaming/asynchronous responses; requires decoupled backend and ResponseFactory usage

Connections to Existing Wiki Pages