Welcome to NVIDIA Run:ai Documentation

Cross-section page — Deployment and Scaling angle. See primary page for the full summary.

Deployment and Scaling Angle

NVIDIA Run:ai addresses a core challenge in production AI: keeping GPU clusters fully utilised as training and inference workloads compete for resources. From a deployment and scaling perspective, its key contributions are:

Dynamic Workload Scheduling

Run:ai continuously monitors GPU utilisation across the cluster and re-allocates idle capacity to queued jobs without manual intervention. This eliminates the idle time between training runs and inference bursts that typically leaves expensive GPU clusters underutilised.

Scaling Across Heterogeneous Infrastructure

Run:ai supports scaling AI workloads across on-premises GPU clusters, public cloud GPUs, and hybrid combinations under a single control plane. Teams do not need separate tools for on-prem scheduling and cloud burst — Run:ai presents a unified resource pool. This is directly relevant to the NCP-AAI deployment-and-scaling exam topic of “operationalising and scaling agentic systems.”

Integration With Existing MLOps Stacks

An API-first architecture means Run:ai integrates with existing Kubernetes deployments, CI/CD pipelines, and framework tools (PyTorch, TensorFlow, JAX, NVIDIA NeMo) without replacing them. Agents and training workloads submitted via the existing toolchain are scheduled by Run:ai transparently.

Connections