Measure and Improve AI Workload Performance with NVIDIA DGX Cloud Benchmarking

Abstract

NVIDIA DGX Cloud Benchmarking is a suite of tools providing standardized, objective performance metrics for AI training and inference workloads, accounting for infrastructure software, cloud platforms, and application configurations—not just raw GPU specs. Authored by Emily Potyraj (NVIDIA, March 2025), the article presents benchmark data on three key TCO levers: GPU count scaling (near-linear; Llama 3.1 70B achieves 97% time reduction training 1 trillion tokens for just 2.6% more cost); precision (FP8 yields higher throughput and lower cost-to-train than BF16 on Hopper architectures, but requires NVIDIA Transformer Engine for per-layer numerical stability); and NeMo framework version (2024 software co-engineering produced a 25% throughput increase). The tool enables organizations to quantify trade-offs without benchmarking from scratch and is validated across AWS, Google Cloud, Microsoft Azure, Oracle Cloud, CoreWeave, Crusoe, and Nebius.

Key Concepts

DGX Cloud Benchmarking Performance Explorer: Interactive tool for exploring GPU count vs. time-to-train and cost-to-train trade-offs across model, precision, and framework configurations.
Near-linear scaling: Scaling Llama 3.1 70B training across many GPUs follows near-ideal linear speedup; slight deviation at high GPU counts is due to communication overhead between nodes.
FP8 vs BF16 precision: FP8 increases throughput and reduces cost-to-train; its narrower dynamic range risks numerical instability without per-tensor or sub-block scaling. Transformer Engine in Hopper and Blackwell architectures provides selective per-layer FP8 application to preserve accuracy.
Transformer Engine: NVIDIA hardware-software feature (Hopper/Blackwell) for applying FP8 selectively on a per-layer basis, using reduced precision only where it will not adversely affect accuracy.
NeMo Framework version optimization: Same hardware and model, 25% throughput gain in 2024 from ongoing deep hardware-software co-engineering. Framework choice affects workload infrastructure fingerprint, communication patterns, and access to continuous optimization.
DGX Cloud Benchmarking Recipes: Published tuning best practices and example baseline results for specific model/hardware combinations, enabling comparison against a reference.
TCO (Total Cost of Ownership): Broader metric than hourly GPU cost; includes time-to-train, software efficiency, and cluster scale. Near-linear scaling means adding GPUs accelerates delivery without proportional cost increase.
Quantization-aware training (QAT) / Post-training quantization (PTQ): Paths to FP8 inference from BF16-trained models; models trained in FP8 can also be deployed directly for FP8 inference.

Key Claims and Findings

Training Llama 3.1 70B on 1 trillion tokens: 115.4 days → 3.8 days (97% reduction) at only a 2.6% cost increase by scaling GPU count — demonstrates the asymmetric cost/time benefit of scaling.
FP8 enables higher math and communication throughput and lower memory bandwidth requirements compared to BF16; it can also enable larger models to fit on fewer GPUs.
A model trained in FP8 can be deployed directly for FP8 inference; BF16-trained models require QAT or PTQ for FP8 inference.
NeMo software optimization alone produced a 25% platform performance improvement in 2024 — framework version selection has a significant, measurable impact on throughput.
Traditional chip-level metrics (FLOPs, hourly GPU cost) are insufficient for AI performance assessment — end-to-end platform benchmarking is required to capture software efficiency and infrastructure interaction.

Terminology

Queue-to-compute ratio: (Used in related Triton/Kubernetes work) metric for autoscaling inference infrastructure.
NVIDIA Performance Architects: NVIDIA experts available to benchmark workloads on DGX Cloud infrastructure and recommend workload-specific tuning adjustments.
DGX Cloud Benchmarking LLM Collection: NGC catalog collection providing model-specific benchmarking recipes.

Connections to Existing Wiki Pages

Scaling LLMs with Triton and TensorRT-LLM Using Kubernetes — companion article on deploying and autoscaling the optimized models in production.
Optimization — NVIDIA Triton Inference Server — covers TensorRT-LLM optimizations (kernel fusion, quantization, paged KV-cache) used in the benchmark pipeline.
Data Flywheel — NeMo microservices stack that includes Customizer and Evaluator used alongside benchmarking in production workflows.

Cross-Section Pages

DGX Cloud Benchmarking (NVIDIA Platform) — angle: DGX platform and NeMo as NVIDIA-specific infrastructure for AI performance measurement.

Personal Wiki

Explorer