Cross-section page — full summary at Scaling LLMs with Triton and TensorRT-LLM Using Kubernetes.
NVIDIA Platform Angle
This article describes the canonical NVIDIA production LLM serving stack: TensorRT-LLM (engine optimization) + NVIDIA Triton Inference Server (now NVIDIA Dynamo Triton) + NVIDIA DCGM Exporter (GPU metrics) + NGC (container registry). This page covers the NVIDIA-specific components and how they compose the platform.
NVIDIA Dynamo Triton
As of March 2025, NVIDIA Triton Inference Server is part of the NVIDIA Dynamo Platform and has been renamed NVIDIA Dynamo Triton. The article (written October 2024) refers to it as Triton Inference Server. Key platform characteristics:
- Supports TensorRT, TensorFlow, PyTorch, ONNX backends
- Exposes metrics at port 8002 (Prometheus-compatible)
- Provides HTTP (8000), gRPC (8001), and metrics (8002) endpoints
- Available as NGC container:
nvcr.io/nvidia/tritonserver:<tag>-trtllm-python-py3
TensorRT-LLM — NVIDIA Inference Optimization
TensorRT-LLM optimizes LLM inference via:
- Kernel fusion: Reduces kernel launch overhead by merging operations
- Quantization: INT8/FP8 weight and activation quantization
- In-flight batching: Continuous batching of requests without waiting for a full batch
- Paged KV-cache: Dynamic memory management for attention key-value caches
The resulting engine files are TensorRT-specific binaries; they are generated once per GPU type and model configuration (TP/PP) and reused across Pods scheduled on the same host node.
NVIDIA DCGM Exporter
The NVIDIA Data Center GPU Manager (DCGM) Exporter is a required component for GPU-aware Kubernetes deployments. It:
- Exports GPU health metrics (temperature, utilization, memory, ECC errors) to Prometheus
- Works alongside the NVIDIA device plugin and GPU Feature Discovery to provide full GPU observability
- Enables Grafana dashboards showing GPU utilization alongside the application-level queue-to-compute ratio
NGC Container Registry
All official NVIDIA Triton + TensorRT-LLM images are distributed through NGC (NVIDIA GPU Cloud). Deployment requires:
- An NGC API key for image pulls
- A Kubernetes docker-registry secret storing the API key
- The base image tag specifying Triton version + TensorRT-LLM backend
Full NVIDIA Platform Stack (Production LLM Serving)
Hugging Face checkpoint
↓ TensorRT-LLM engine build (NGC container)
TensorRT engine files (host node)
↓ Triton/Dynamo Triton (Kubernetes Pod, NGC image)
Kubernetes Deployment + Service
↓ HPA ← Prometheus ← PodMonitor (port 8002)
DCGM Exporter → GPU metrics
↓ Grafana dashboard
Connections to NVIDIA Platform Stack
- Performance Analysis — TensorRT LLM (NVIDIA Platform) — Nsight Systems profiling of the TensorRT-LLM layer for deeper optimization once the deployment is running.
- NVIDIA Nsight Systems (NVIDIA Platform) — system-wide profiler that sits above the DCGM/Triton metrics layer for deeper GPU timeline analysis.
- DGX Cloud Benchmarking (NVIDIA Platform) — benchmarking the training configuration that determines which TensorRT-LLM engine parameters to deploy.