Welcome to NVIDIA Run:ai Documentation

NVIDIA Run:ai product documentation — run-ai-docs.nvidia.com, updated 2026-01-24

Abstract

NVIDIA Run:ai is an AI workload orchestration platform purpose-built for accelerating the full AI lifecycle. It maximises GPU efficiency by dynamically distributing training and inference workloads across heterogeneous infrastructure — on-premises, cloud, and hybrid environments — with zero manual scheduling effort. Run:ai provides a centralised control plane for managing AI infrastructure, integrates via an API-first open architecture with all major AI frameworks and third-party tooling, and enables teams to scale workloads flexibly wherever compute resources reside.

Key Concepts

AI-native workload orchestration: scheduling engine designed specifically for the bursty, resource-intensive, and long-running patterns of AI training and inference jobs, as opposed to general-purpose Kubernetes scheduling
Dynamic GPU allocation: Run:ai tracks real-time GPU utilisation and re-allocates idle capacity across queued workloads, ensuring maximum utilisation without manual intervention
Unified AI infrastructure management: single control plane spanning on-premises GPUs, public cloud (multi-cloud), and hybrid environments — teams see one pool of compute regardless of where physical resources reside
Flexible AI deployment: workloads run wherever they need to — on-prem for data sovereignty, cloud for burst capacity, or hybrid for cost and latency optimisation
Open architecture: API-first design integrates with all major AI frameworks (PyTorch, TensorFlow, JAX), MLOps tools, and third-party solutions without locking users into a proprietary stack

Key Capabilities

Capability	Benefit
Dynamic orchestration	Eliminates idle GPU time; jobs are automatically re-queued or migrated
Hybrid/multi-cloud support	Unified scheduling across on-prem and cloud with consistent policies
AI framework compatibility	Works with PyTorch, TensorFlow, JAX, Hugging Face, and NVIDIA NeMo
API-first integration	Connects to existing MLOps pipelines, CI/CD, and monitoring without replacing them
Lifecycle coverage	Covers training, fine-tuning, and inference workloads from a single platform

Terminology

Workload orchestration: automated placement, scheduling, and lifecycle management of compute jobs on a cluster
GPU efficiency: ratio of actual GPU compute utilised to total available capacity; Run:ai targets near-100% utilisation by eliminating idle time between jobs
Hybrid AI infrastructure: combination of on-premises compute and one or more public cloud providers managed as a single logical pool

Connections to Existing Wiki Pages

NVIDIA Run:ai (Deployment and Scaling angle) — cross-section page covering scheduling and multi-cloud scaling perspective
Scaling LLMs with Triton and TensorRT-LLM — complementary article on Kubernetes-based LLM serving that Run:ai orchestrates
DGX Cloud Benchmarking — benchmarking suite measuring the workload performance that Run:ai scheduling optimises

Personal Wiki

Explorer

Welcome to NVIDIA Run:ai Documentation

Welcome to NVIDIA Run:ai Documentation

Abstract

Key Concepts

Key Capabilities

Terminology

Connections to Existing Wiki Pages

Graph View

Table of Contents

Backlinks