Welcome to NVIDIA Run:ai Documentation

NVIDIA Run:ai product documentation — run-ai-docs.nvidia.com, updated 2026-01-24

Abstract

NVIDIA Run:ai is an AI workload orchestration platform purpose-built for accelerating the full AI lifecycle. It maximises GPU efficiency by dynamically distributing training and inference workloads across heterogeneous infrastructure — on-premises, cloud, and hybrid environments — with zero manual scheduling effort. Run:ai provides a centralised control plane for managing AI infrastructure, integrates via an API-first open architecture with all major AI frameworks and third-party tooling, and enables teams to scale workloads flexibly wherever compute resources reside.

Key Concepts

  • AI-native workload orchestration: scheduling engine designed specifically for the bursty, resource-intensive, and long-running patterns of AI training and inference jobs, as opposed to general-purpose Kubernetes scheduling
  • Dynamic GPU allocation: Run:ai tracks real-time GPU utilisation and re-allocates idle capacity across queued workloads, ensuring maximum utilisation without manual intervention
  • Unified AI infrastructure management: single control plane spanning on-premises GPUs, public cloud (multi-cloud), and hybrid environments — teams see one pool of compute regardless of where physical resources reside
  • Flexible AI deployment: workloads run wherever they need to — on-prem for data sovereignty, cloud for burst capacity, or hybrid for cost and latency optimisation
  • Open architecture: API-first design integrates with all major AI frameworks (PyTorch, TensorFlow, JAX), MLOps tools, and third-party solutions without locking users into a proprietary stack

Key Capabilities

CapabilityBenefit
Dynamic orchestrationEliminates idle GPU time; jobs are automatically re-queued or migrated
Hybrid/multi-cloud supportUnified scheduling across on-prem and cloud with consistent policies
AI framework compatibilityWorks with PyTorch, TensorFlow, JAX, Hugging Face, and NVIDIA NeMo
API-first integrationConnects to existing MLOps pipelines, CI/CD, and monitoring without replacing them
Lifecycle coverageCovers training, fine-tuning, and inference workloads from a single platform

Terminology

  • Workload orchestration: automated placement, scheduling, and lifecycle management of compute jobs on a cluster
  • GPU efficiency: ratio of actual GPU compute utilised to total available capacity; Run:ai targets near-100% utilisation by eliminating idle time between jobs
  • Hybrid AI infrastructure: combination of on-premises compute and one or more public cloud providers managed as a single logical pool

Connections to Existing Wiki Pages