Welcome to NVIDIA Run:ai Documentation
Cross-section page — Deployment and Scaling angle. See primary page for the full summary.
Deployment and Scaling Angle
NVIDIA Run:ai addresses a core challenge in production AI: keeping GPU clusters fully utilised as training and inference workloads compete for resources. From a deployment and scaling perspective, its key contributions are:
Dynamic Workload Scheduling
Run:ai continuously monitors GPU utilisation across the cluster and re-allocates idle capacity to queued jobs without manual intervention. This eliminates the idle time between training runs and inference bursts that typically leaves expensive GPU clusters underutilised.
Scaling Across Heterogeneous Infrastructure
Run:ai supports scaling AI workloads across on-premises GPU clusters, public cloud GPUs, and hybrid combinations under a single control plane. Teams do not need separate tools for on-prem scheduling and cloud burst — Run:ai presents a unified resource pool. This is directly relevant to the NCP-AAI deployment-and-scaling exam topic of “operationalising and scaling agentic systems.”
Integration With Existing MLOps Stacks
An API-first architecture means Run:ai integrates with existing Kubernetes deployments, CI/CD pipelines, and framework tools (PyTorch, TensorFlow, JAX, NVIDIA NeMo) without replacing them. Agents and training workloads submitted via the existing toolchain are scheduled by Run:ai transparently.
Connections
- Scaling LLMs with Triton and TensorRT-LLM Using Kubernetes — the Kubernetes-based inference stack that Run:ai can orchestrate at the cluster level
- DGX Cloud Benchmarking — benchmarking the workload throughput that Run:ai dynamic scheduling helps maximise