What is Kubernetes?

Abstract

This NVIDIA glossary article provides a foundational overview of Kubernetes as an open-source container orchestration platform, positioned for AI and GPU workloads. Kubernetes groups containers into pods, manages them through the kubelet node agent and controllers, and automates service discovery, load balancing, storage mounting, rollouts/rollbacks, and health monitoring. The article explains GPU-specific Kubernetes extensions from NVIDIA — the device plugin (exposing GPUs as schedulable resources), GPU Feature Discovery (node attribute labeling), DCGM+Prometheus+Grafana monitoring, and MIG (Multi-Instance GPU on A100, up to 7 slices per GPU). NVIDIA Triton Inference Server acts as the hardware abstraction layer within Kubernetes nodes, while Kubernetes handles cluster-level orchestration. Extensions such as Kubeflow (ML pipeline management) and Istio (service mesh) extend Kubernetes for AI platform use.

Key Concepts

Pod: Kubernetes grouping of co-located containers sharing a network IP address and storage resources. Each pod can contain a Triton server container alongside sidecar containers; enables tight coupling between components without conflict.
Kubelet: Node-level agent managing pod lifecycle, container health, and resource allocation on each node. Works with controllers to ensure the desired state is maintained.
Horizontal Pod Autoscaler (HPA): Controller that scales Pod replica count based on observed metrics (CPU, memory, or custom metrics via the metrics API). See Scaling LLMs with Triton and TensorRT-LLM Using Kubernetes for a concrete HPA example using Triton metrics.
NVIDIA device plugin: Kubernetes extension that exposes GPU resources as schedulable resources (e.g., nvidia.com/gpu: 1 in a Pod spec), enabling the Kubernetes scheduler to place GPU workloads on appropriate nodes.
GPU Feature Discovery: Labels Kubernetes nodes with GPU attributes (type, memory, CUDA capability), enabling heterogeneous GPU cluster scheduling — pods can request specific GPU types.
DCGM (Data Center GPU Manager): NVIDIA tool for GPU health monitoring, providing telemetry on temperature, utilization, memory, and ECC errors. Integrated with Prometheus and Grafana in Kubernetes deployments.
MIG (Multi-Instance GPU): Feature introduced with the NVIDIA A100 that partitions a single GPU into up to 7 independent instances, each with isolated compute, memory, and bandwidth. In a DGX A100 (8×A100), MIG enables up to 56 virtual GPU instances — enabling multiple Kubernetes Pods to share a GPU without conflict.
EGX stack: NVIDIA’s cloud-native, Kubernetes-managed software stack for containerized accelerated AI; enables deploying updated AI containers in minutes.
Namespace: Virtual cluster within a physical Kubernetes cluster; enables multi-tenancy by allowing multiple teams to share hardware with isolated resources and services.
Kubeflow: Kubernetes extension that streamlines ML workflow and pipeline management; supports distributed training and model serving. Notable extension: BinderHub for container image publishing from git repos.
Istio: Open-source service mesh layer for Kubernetes providing automated load balancing, service-to-service authentication, fault injection, and monitoring with minimal code changes.

Key Claims and Findings

MIG enables linear application/resource scaling without requiring one dedicated GPU per Kubernetes node — a prerequisite to full GPU utilization in multi-tenant clusters.
Kubernetes + Triton complementarity: Kubernetes orchestrates the cluster (scheduling, scaling, service discovery); Triton abstracts GPU hardware within each node (model serving, batching, backend framework support).
Every major cloud provider and computing platform supports the same underlying Kubernetes code base (CNCF governance prevents forks); branded variants (Red Hat OpenShift, AWS EKS) share the same core.
Serverless computing relies on containers and Kubernetes for millisecond-scale startup and cost-effective resource use.
Kubernetes namespaces enable ops and dev teams to share the same physical machines and services without creating conflicts — relevant for multi-team AI platform deployments.

GPU-Accelerated Kubernetes Feature Set (NVIDIA)

Component	Role
NVIDIA device plugin	Exposes `nvidia.com/gpu` as a schedulable resource
GPU Feature Discovery	Labels nodes with GPU type, memory, and capabilities
DCGM Exporter	Exports GPU health/utilization metrics to Prometheus
MIG (A100)	Partitions GPU into up to 7 isolated instances
EGX stack	Full cloud-native AI stack managed by Kubernetes

Connections to Existing Wiki Pages

Scaling LLMs with Triton and TensorRT-LLM Using Kubernetes — hands-on implementation: Helm chart, HPA with queue-to-compute ratio, PodMonitor, and autoscaling Triton in Kubernetes.
Optimization — NVIDIA Triton Inference Server — covers Triton’s internal optimization capabilities; this article explains the Kubernetes layer above Triton.

Personal Wiki

Explorer

What is Kubernetes?

Abstract

Key Concepts

Key Claims and Findings

GPU-Accelerated Kubernetes Feature Set (NVIDIA)

Connections to Existing Wiki Pages

Graph View

Table of Contents

Backlinks