Scaling LLMs with NVIDIA Triton and TensorRT-LLM Using Kubernetes (NVIDIA Platform)

Cross-section page — full summary at Scaling LLMs with Triton and TensorRT-LLM Using Kubernetes.

NVIDIA Platform Angle

This article describes the canonical NVIDIA production LLM serving stack: TensorRT-LLM (engine optimization) + NVIDIA Triton Inference Server (now NVIDIA Dynamo Triton) + NVIDIA DCGM Exporter (GPU metrics) + NGC (container registry). This page covers the NVIDIA-specific components and how they compose the platform.

NVIDIA Dynamo Triton

As of March 2025, NVIDIA Triton Inference Server is part of the NVIDIA Dynamo Platform and has been renamed NVIDIA Dynamo Triton. The article (written October 2024) refers to it as Triton Inference Server. Key platform characteristics:

Supports TensorRT, TensorFlow, PyTorch, ONNX backends
Exposes metrics at port 8002 (Prometheus-compatible)
Provides HTTP (8000), gRPC (8001), and metrics (8002) endpoints
Available as NGC container: nvcr.io/nvidia/tritonserver:<tag>-trtllm-python-py3

TensorRT-LLM — NVIDIA Inference Optimization

TensorRT-LLM optimizes LLM inference via:

Kernel fusion: Reduces kernel launch overhead by merging operations
Quantization: INT8/FP8 weight and activation quantization
In-flight batching: Continuous batching of requests without waiting for a full batch
Paged KV-cache: Dynamic memory management for attention key-value caches

The resulting engine files are TensorRT-specific binaries; they are generated once per GPU type and model configuration (TP/PP) and reused across Pods scheduled on the same host node.

NVIDIA DCGM Exporter

The NVIDIA Data Center GPU Manager (DCGM) Exporter is a required component for GPU-aware Kubernetes deployments. It:

Exports GPU health metrics (temperature, utilization, memory, ECC errors) to Prometheus
Works alongside the NVIDIA device plugin and GPU Feature Discovery to provide full GPU observability
Enables Grafana dashboards showing GPU utilization alongside the application-level queue-to-compute ratio

NGC Container Registry

All official NVIDIA Triton + TensorRT-LLM images are distributed through NGC (NVIDIA GPU Cloud). Deployment requires:

An NGC API key for image pulls
A Kubernetes docker-registry secret storing the API key
The base image tag specifying Triton version + TensorRT-LLM backend

Full NVIDIA Platform Stack (Production LLM Serving)

Hugging Face checkpoint
    ↓ TensorRT-LLM engine build (NGC container)
TensorRT engine files (host node)
    ↓ Triton/Dynamo Triton (Kubernetes Pod, NGC image)
Kubernetes Deployment + Service
    ↓ HPA ← Prometheus ← PodMonitor (port 8002)
DCGM Exporter → GPU metrics
    ↓ Grafana dashboard

Connections to NVIDIA Platform Stack

Performance Analysis — TensorRT LLM (NVIDIA Platform) — Nsight Systems profiling of the TensorRT-LLM layer for deeper optimization once the deployment is running.
NVIDIA Nsight Systems (NVIDIA Platform) — system-wide profiler that sits above the DCGM/Triton metrics layer for deeper GPU timeline analysis.
DGX Cloud Benchmarking (NVIDIA Platform) — benchmarking the training configuration that determines which TensorRT-LLM engine parameters to deploy.

Personal Wiki

Explorer