Scaling LLMs with NVIDIA Triton and NVIDIA TensorRT-LLM Using Kubernetes

Abstract

Authored by Maggie Zhang (NVIDIA Developer Blog, October 2024; note: NVIDIA Triton Inference Server was renamed NVIDIA Dynamo Triton as of March 2025), this article provides a step-by-step walkthrough for deploying LLMs optimized with TensorRT-LLM using Triton Inference Server and autoscaling the deployment with Kubernetes HPA. The pipeline has three phases: (1) build TensorRT-LLM engines from Hugging Face checkpoints, configuring tensor parallelism (TP) and pipeline parallelism (PP) based on GPU count; (2) deploy via a Helm chart creating a Kubernetes Deployment, Service, and PodMonitor for Triton; (3) autoscale using HPA driven by a custom Prometheus metric — the queue-to-compute ratio — scraped from Triton’s metrics port (8002) every 6 seconds. The HPA scales replicas between 1 and 4 based on whether the ratio exceeds 1,000 milliunit (queue time > compute time). Load balancing across replicas uses Traefik or NGINX Plus (Layer 7) or cloud load balancers.

Key Concepts

TensorRT-LLM engine building: Converts Hugging Face model checkpoints to optimized TensorRT engines. Optimizations include kernel fusion, quantization, in-flight batching, and paged KV-cache. GPU count requirement = TP × PP.
Tensor parallelism (TP) / Pipeline parallelism (PP): TensorRT-LLM’s two axes of model parallelism. TP splits the model horizontally across GPUs; PP divides it into pipeline stages. Each Kubernetes Pod requires TP×PP GPUs.
Queue-to-compute ratio: Custom Prometheus metric for HPA. Defined as: rate(nv_inference_queue_duration_us[1m]) / clamp_min(rate(nv_inference_compute_infer_duration_us[1m]), 1) Measures average fraction of inference request time spent waiting in queue vs. executing. A value >1,000 milliunit (queue time > compute time) triggers scale-up.
Horizontal Pod Autoscaler (HPA): Kubernetes controller that adjusts Pod replica count. Configured here for 1–4 replicas, using Pods metrics type to average the queue-to-compute ratio across all running Pods.
PodMonitor: Kubernetes CRD (from kube-prometheus) that configures Prometheus to scrape Triton’s /metrics endpoint on port 8002 every 6 seconds.
Prometheus adapter: Translates Prometheus-scraped metrics into the Kubernetes custom metrics API so HPA can use them for scaling decisions.
Helm chart: Kubernetes package manager deployment; values.yaml specifies GPU type, model name, TP parallelism, container image, and pull secrets. Custom *_values.yaml files override defaults per model.
NGC (NVIDIA GPU Cloud): Container registry providing official nvcr.io/nvidia/tritonserver:<tag>-trtllm-python-py3 base images; API key required.
NVIDIA device plugin / GPU Feature Discovery / DCGM Exporter: Required Kubernetes extensions for GPU-aware scheduling (device plugin), node labeling with GPU attributes (GPU Feature Discovery), and GPU health/metrics export to Prometheus (DCGM Exporter).
Engine file host-node remapping: Generated TensorRT engine and plan files are stored on the host node and remapped to all Pods scheduled on that node — eliminates re-generation on HPA scale-up.

Key Claims and Findings

The custom queue-to-compute ratio metric is more meaningful for autoscaling than raw GPU utilization because it directly reflects whether clients are experiencing queue delays.
Engine files stored on the host node and remapped to Pods: when HPA scales from 1→4 replicas on the same node, no re-build is required — faster scale-up.
Grafana dashboard at localhost:8080 visualizes GPU utilization and queue-to-compute ratio in time series.
Cloud-managed Kubernetes supported: AWS EKS, Azure AKS, GCP GKE, OCI OKE.
Layer 7 load balancers (Traefik, NGINX Plus) or AWS ALB/NLB route inference traffic across Pods.

Kubernetes Resource Configuration Summary

Resource	Purpose	Key Fields
Deployment	Set of replicated Triton Pods	`spec.replicas`, GPU resource request, ports 8000/8001/8002
Service (ClusterIP)	Network endpoint for Triton	Exposes ports 8000 (HTTP), 8001 (gRPC), 8002 (metrics)
PodMonitor	Prometheus scrape target	Port `metrics` (8002), path `/metrics`
PrometheusRule	Custom metric definition	`triton:queue_compute:ratio`, interval 6s
HPA	Autoscaling controller	`minReplicas: 1`, `maxReplicas: 4`, target `1000m`

Connections to Existing Wiki Pages

What is Kubernetes — foundational concepts: pods, kubelet, HPA, NVIDIA device plugin, DCGM, MIG; complements the hands-on deployment steps in this article.
Optimization — NVIDIA Triton Inference Server — covers Triton’s optimization features (model analyzer, perf_analyzer, dynamic batching) that complement the Kubernetes autoscaling described here.
Performance Analysis — TensorRT LLM — deeper profiling of the TensorRT-LLM engine using Nsight Systems once the deployment is running.

Cross-Section Pages

Scaling LLMs with Triton and TensorRT-LLM (NVIDIA Platform) — angle: TensorRT-LLM, NVIDIA Dynamo Triton, DCGM Exporter, and NGC as NVIDIA platform components in the production LLM serving stack.

Personal Wiki

Explorer