Abstract

Authored by Maggie Zhang (NVIDIA Developer Blog, October 2024; note: NVIDIA Triton Inference Server was renamed NVIDIA Dynamo Triton as of March 2025), this article provides a step-by-step walkthrough for deploying LLMs optimized with TensorRT-LLM using Triton Inference Server and autoscaling the deployment with Kubernetes HPA. The pipeline has three phases: (1) build TensorRT-LLM engines from Hugging Face checkpoints, configuring tensor parallelism (TP) and pipeline parallelism (PP) based on GPU count; (2) deploy via a Helm chart creating a Kubernetes Deployment, Service, and PodMonitor for Triton; (3) autoscale using HPA driven by a custom Prometheus metric — the queue-to-compute ratio — scraped from Triton’s metrics port (8002) every 6 seconds. The HPA scales replicas between 1 and 4 based on whether the ratio exceeds 1,000 milliunit (queue time > compute time). Load balancing across replicas uses Traefik or NGINX Plus (Layer 7) or cloud load balancers.

Key Concepts

  • TensorRT-LLM engine building: Converts Hugging Face model checkpoints to optimized TensorRT engines. Optimizations include kernel fusion, quantization, in-flight batching, and paged KV-cache. GPU count requirement = TP × PP.
  • Tensor parallelism (TP) / Pipeline parallelism (PP): TensorRT-LLM’s two axes of model parallelism. TP splits the model horizontally across GPUs; PP divides it into pipeline stages. Each Kubernetes Pod requires TP×PP GPUs.
  • Queue-to-compute ratio: Custom Prometheus metric for HPA. Defined as: rate(nv_inference_queue_duration_us[1m]) / clamp_min(rate(nv_inference_compute_infer_duration_us[1m]), 1) Measures average fraction of inference request time spent waiting in queue vs. executing. A value >1,000 milliunit (queue time > compute time) triggers scale-up.
  • Horizontal Pod Autoscaler (HPA): Kubernetes controller that adjusts Pod replica count. Configured here for 1–4 replicas, using Pods metrics type to average the queue-to-compute ratio across all running Pods.
  • PodMonitor: Kubernetes CRD (from kube-prometheus) that configures Prometheus to scrape Triton’s /metrics endpoint on port 8002 every 6 seconds.
  • Prometheus adapter: Translates Prometheus-scraped metrics into the Kubernetes custom metrics API so HPA can use them for scaling decisions.
  • Helm chart: Kubernetes package manager deployment; values.yaml specifies GPU type, model name, TP parallelism, container image, and pull secrets. Custom *_values.yaml files override defaults per model.
  • NGC (NVIDIA GPU Cloud): Container registry providing official nvcr.io/nvidia/tritonserver:<tag>-trtllm-python-py3 base images; API key required.
  • NVIDIA device plugin / GPU Feature Discovery / DCGM Exporter: Required Kubernetes extensions for GPU-aware scheduling (device plugin), node labeling with GPU attributes (GPU Feature Discovery), and GPU health/metrics export to Prometheus (DCGM Exporter).
  • Engine file host-node remapping: Generated TensorRT engine and plan files are stored on the host node and remapped to all Pods scheduled on that node — eliminates re-generation on HPA scale-up.

Key Claims and Findings

  • The custom queue-to-compute ratio metric is more meaningful for autoscaling than raw GPU utilization because it directly reflects whether clients are experiencing queue delays.
  • Engine files stored on the host node and remapped to Pods: when HPA scales from 1→4 replicas on the same node, no re-build is required — faster scale-up.
  • Grafana dashboard at localhost:8080 visualizes GPU utilization and queue-to-compute ratio in time series.
  • Cloud-managed Kubernetes supported: AWS EKS, Azure AKS, GCP GKE, OCI OKE.
  • Layer 7 load balancers (Traefik, NGINX Plus) or AWS ALB/NLB route inference traffic across Pods.

Kubernetes Resource Configuration Summary

ResourcePurposeKey Fields
DeploymentSet of replicated Triton Podsspec.replicas, GPU resource request, ports 8000/8001/8002
Service (ClusterIP)Network endpoint for TritonExposes ports 8000 (HTTP), 8001 (gRPC), 8002 (metrics)
PodMonitorPrometheus scrape targetPort metrics (8002), path /metrics
PrometheusRuleCustom metric definitiontriton:queue_compute:ratio, interval 6s
HPAAutoscaling controllerminReplicas: 1, maxReplicas: 4, target 1000m

Connections to Existing Wiki Pages

Cross-Section Pages