Abstract
Authored by Maggie Zhang (NVIDIA Developer Blog, October 2024; note: NVIDIA Triton Inference Server was renamed NVIDIA Dynamo Triton as of March 2025), this article provides a step-by-step walkthrough for deploying LLMs optimized with TensorRT-LLM using Triton Inference Server and autoscaling the deployment with Kubernetes HPA. The pipeline has three phases: (1) build TensorRT-LLM engines from Hugging Face checkpoints, configuring tensor parallelism (TP) and pipeline parallelism (PP) based on GPU count; (2) deploy via a Helm chart creating a Kubernetes Deployment, Service, and PodMonitor for Triton; (3) autoscale using HPA driven by a custom Prometheus metric — the queue-to-compute ratio — scraped from Triton’s metrics port (8002) every 6 seconds. The HPA scales replicas between 1 and 4 based on whether the ratio exceeds 1,000 milliunit (queue time > compute time). Load balancing across replicas uses Traefik or NGINX Plus (Layer 7) or cloud load balancers.
Key Concepts
- TensorRT-LLM engine building: Converts Hugging Face model checkpoints to optimized TensorRT engines. Optimizations include kernel fusion, quantization, in-flight batching, and paged KV-cache. GPU count requirement = TP × PP.
- Tensor parallelism (TP) / Pipeline parallelism (PP): TensorRT-LLM’s two axes of model parallelism. TP splits the model horizontally across GPUs; PP divides it into pipeline stages. Each Kubernetes Pod requires TP×PP GPUs.
- Queue-to-compute ratio: Custom Prometheus metric for HPA. Defined as:
rate(nv_inference_queue_duration_us[1m]) / clamp_min(rate(nv_inference_compute_infer_duration_us[1m]), 1)Measures average fraction of inference request time spent waiting in queue vs. executing. A value >1,000 milliunit (queue time > compute time) triggers scale-up. - Horizontal Pod Autoscaler (HPA): Kubernetes controller that adjusts Pod replica count. Configured here for 1–4 replicas, using
Podsmetrics type to average the queue-to-compute ratio across all running Pods. - PodMonitor: Kubernetes CRD (from kube-prometheus) that configures Prometheus to scrape Triton’s
/metricsendpoint on port 8002 every 6 seconds. - Prometheus adapter: Translates Prometheus-scraped metrics into the Kubernetes custom metrics API so HPA can use them for scaling decisions.
- Helm chart: Kubernetes package manager deployment;
values.yamlspecifies GPU type, model name, TP parallelism, container image, and pull secrets. Custom*_values.yamlfiles override defaults per model. - NGC (NVIDIA GPU Cloud): Container registry providing official
nvcr.io/nvidia/tritonserver:<tag>-trtllm-python-py3base images; API key required. - NVIDIA device plugin / GPU Feature Discovery / DCGM Exporter: Required Kubernetes extensions for GPU-aware scheduling (device plugin), node labeling with GPU attributes (GPU Feature Discovery), and GPU health/metrics export to Prometheus (DCGM Exporter).
- Engine file host-node remapping: Generated TensorRT engine and plan files are stored on the host node and remapped to all Pods scheduled on that node — eliminates re-generation on HPA scale-up.
Key Claims and Findings
- The custom queue-to-compute ratio metric is more meaningful for autoscaling than raw GPU utilization because it directly reflects whether clients are experiencing queue delays.
- Engine files stored on the host node and remapped to Pods: when HPA scales from 1→4 replicas on the same node, no re-build is required — faster scale-up.
- Grafana dashboard at
localhost:8080visualizes GPU utilization and queue-to-compute ratio in time series. - Cloud-managed Kubernetes supported: AWS EKS, Azure AKS, GCP GKE, OCI OKE.
- Layer 7 load balancers (Traefik, NGINX Plus) or AWS ALB/NLB route inference traffic across Pods.
Kubernetes Resource Configuration Summary
| Resource | Purpose | Key Fields |
|---|---|---|
| Deployment | Set of replicated Triton Pods | spec.replicas, GPU resource request, ports 8000/8001/8002 |
| Service (ClusterIP) | Network endpoint for Triton | Exposes ports 8000 (HTTP), 8001 (gRPC), 8002 (metrics) |
| PodMonitor | Prometheus scrape target | Port metrics (8002), path /metrics |
| PrometheusRule | Custom metric definition | triton:queue_compute:ratio, interval 6s |
| HPA | Autoscaling controller | minReplicas: 1, maxReplicas: 4, target 1000m |
Connections to Existing Wiki Pages
- What is Kubernetes — foundational concepts: pods, kubelet, HPA, NVIDIA device plugin, DCGM, MIG; complements the hands-on deployment steps in this article.
- Optimization — NVIDIA Triton Inference Server — covers Triton’s optimization features (model analyzer, perf_analyzer, dynamic batching) that complement the Kubernetes autoscaling described here.
- Performance Analysis — TensorRT LLM — deeper profiling of the TensorRT-LLM engine using Nsight Systems once the deployment is running.
Cross-Section Pages
- Scaling LLMs with Triton and TensorRT-LLM (NVIDIA Platform) — angle: TensorRT-LLM, NVIDIA Dynamo Triton, DCGM Exporter, and NGC as NVIDIA platform components in the production LLM serving stack.