Cross-section page — full summary at Scaling LLMs with Triton and TensorRT-LLM Using Kubernetes.

NVIDIA Platform Angle

This article describes the canonical NVIDIA production LLM serving stack: TensorRT-LLM (engine optimization) + NVIDIA Triton Inference Server (now NVIDIA Dynamo Triton) + NVIDIA DCGM Exporter (GPU metrics) + NGC (container registry). This page covers the NVIDIA-specific components and how they compose the platform.

NVIDIA Dynamo Triton

As of March 2025, NVIDIA Triton Inference Server is part of the NVIDIA Dynamo Platform and has been renamed NVIDIA Dynamo Triton. The article (written October 2024) refers to it as Triton Inference Server. Key platform characteristics:

  • Supports TensorRT, TensorFlow, PyTorch, ONNX backends
  • Exposes metrics at port 8002 (Prometheus-compatible)
  • Provides HTTP (8000), gRPC (8001), and metrics (8002) endpoints
  • Available as NGC container: nvcr.io/nvidia/tritonserver:<tag>-trtllm-python-py3

TensorRT-LLM — NVIDIA Inference Optimization

TensorRT-LLM optimizes LLM inference via:

  • Kernel fusion: Reduces kernel launch overhead by merging operations
  • Quantization: INT8/FP8 weight and activation quantization
  • In-flight batching: Continuous batching of requests without waiting for a full batch
  • Paged KV-cache: Dynamic memory management for attention key-value caches

The resulting engine files are TensorRT-specific binaries; they are generated once per GPU type and model configuration (TP/PP) and reused across Pods scheduled on the same host node.

NVIDIA DCGM Exporter

The NVIDIA Data Center GPU Manager (DCGM) Exporter is a required component for GPU-aware Kubernetes deployments. It:

  • Exports GPU health metrics (temperature, utilization, memory, ECC errors) to Prometheus
  • Works alongside the NVIDIA device plugin and GPU Feature Discovery to provide full GPU observability
  • Enables Grafana dashboards showing GPU utilization alongside the application-level queue-to-compute ratio

NGC Container Registry

All official NVIDIA Triton + TensorRT-LLM images are distributed through NGC (NVIDIA GPU Cloud). Deployment requires:

  • An NGC API key for image pulls
  • A Kubernetes docker-registry secret storing the API key
  • The base image tag specifying Triton version + TensorRT-LLM backend

Full NVIDIA Platform Stack (Production LLM Serving)

Hugging Face checkpoint
    ↓ TensorRT-LLM engine build (NGC container)
TensorRT engine files (host node)
    ↓ Triton/Dynamo Triton (Kubernetes Pod, NGC image)
Kubernetes Deployment + Service
    ↓ HPA ← Prometheus ← PodMonitor (port 8002)
DCGM Exporter → GPU metrics
    ↓ Grafana dashboard

Connections to NVIDIA Platform Stack