Welcome to NVIDIA Run:ai Documentation
NVIDIA Run:ai product documentation — run-ai-docs.nvidia.com, updated 2026-01-24
Abstract
NVIDIA Run:ai is an AI workload orchestration platform purpose-built for accelerating the full AI lifecycle. It maximises GPU efficiency by dynamically distributing training and inference workloads across heterogeneous infrastructure — on-premises, cloud, and hybrid environments — with zero manual scheduling effort. Run:ai provides a centralised control plane for managing AI infrastructure, integrates via an API-first open architecture with all major AI frameworks and third-party tooling, and enables teams to scale workloads flexibly wherever compute resources reside.
Key Concepts
- AI-native workload orchestration: scheduling engine designed specifically for the bursty, resource-intensive, and long-running patterns of AI training and inference jobs, as opposed to general-purpose Kubernetes scheduling
- Dynamic GPU allocation: Run:ai tracks real-time GPU utilisation and re-allocates idle capacity across queued workloads, ensuring maximum utilisation without manual intervention
- Unified AI infrastructure management: single control plane spanning on-premises GPUs, public cloud (multi-cloud), and hybrid environments — teams see one pool of compute regardless of where physical resources reside
- Flexible AI deployment: workloads run wherever they need to — on-prem for data sovereignty, cloud for burst capacity, or hybrid for cost and latency optimisation
- Open architecture: API-first design integrates with all major AI frameworks (PyTorch, TensorFlow, JAX), MLOps tools, and third-party solutions without locking users into a proprietary stack
Key Capabilities
| Capability | Benefit |
|---|---|
| Dynamic orchestration | Eliminates idle GPU time; jobs are automatically re-queued or migrated |
| Hybrid/multi-cloud support | Unified scheduling across on-prem and cloud with consistent policies |
| AI framework compatibility | Works with PyTorch, TensorFlow, JAX, Hugging Face, and NVIDIA NeMo |
| API-first integration | Connects to existing MLOps pipelines, CI/CD, and monitoring without replacing them |
| Lifecycle coverage | Covers training, fine-tuning, and inference workloads from a single platform |
Terminology
- Workload orchestration: automated placement, scheduling, and lifecycle management of compute jobs on a cluster
- GPU efficiency: ratio of actual GPU compute utilised to total available capacity; Run:ai targets near-100% utilisation by eliminating idle time between jobs
- Hybrid AI infrastructure: combination of on-premises compute and one or more public cloud providers managed as a single logical pool
Connections to Existing Wiki Pages
- NVIDIA Run:ai (Deployment and Scaling angle) — cross-section page covering scheduling and multi-cloud scaling perspective
- Scaling LLMs with Triton and TensorRT-LLM — complementary article on Kubernetes-based LLM serving that Run:ai orchestrates
- DGX Cloud Benchmarking — benchmarking suite measuring the workload performance that Run:ai scheduling optimises