Measure and Improve AI Workload Performance with NVIDIA DGX Cloud Benchmarking (NVIDIA Platform)

Cross-section page — full summary at Measure and Improve AI Workload Performance with NVIDIA DGX Cloud Benchmarking.

NVIDIA Platform Angle

NVIDIA DGX Cloud Benchmarking is built around NVIDIA’s own platform stack — the DGX hardware family, the NeMo Framework, and Hopper/Blackwell architecture features such as Transformer Engine. This page covers the NVIDIA-specific platform components that drive the benchmarked performance gains.

DGX Cloud as Benchmarking Substrate

DGX Cloud Benchmarking establishes standardized performance baselines on NVIDIA DGX infrastructure. The benchmarking recipes in the NGC catalog are validated on DGX systems and the major cloud providers’ NVIDIA GPU instances (AWS, Google Cloud, Azure, Oracle Cloud) as well as NVIDIA cloud partners (CoreWeave, Crusoe, Nebius). This provides a common reference for comparing platform implementations.

NeMo Framework — Version-Driven Performance

The NeMo Framework is the primary training framework used in DGX Cloud Benchmarking. Crucially, upgrading NeMo versions alone — same hardware, same model — produced a 25% platform throughput increase in 2024 through deep hardware-software co-engineering. This makes NeMo version selection a first-class optimization lever alongside GPU count and precision.

Hopper/Blackwell Architecture — Transformer Engine and FP8

NVIDIA Hopper and Blackwell architectures include the Transformer Engine, which enables per-layer FP8 computation. FP8 increases math throughput and reduces memory bandwidth requirements compared to BF16, but its narrower dynamic range requires per-tensor or sub-block scaling to maintain numerical stability. Transformer Engine handles this automatically on a per-layer basis.

Key result: FP8 training on Hopper produces higher tokens/second throughput than BF16, and models trained in FP8 can be deployed directly for FP8 inference without additional quantization steps.

DGX Cloud Benchmarking Recipes

NVIDIA publishes performance recipes in the NGC catalog (nvidia/teams/dgxc-benchmarking) covering:

FP8 vs BF16 precision tuning best practices
Optimal GPU counts for specific model families (e.g. Llama 3 70B on various cluster sizes)
NeMo framework configuration baselines

These recipes enable organizations to validate their own platform implementations against NVIDIA’s reference performance numbers without building benchmarks from scratch.

Connections to NVIDIA Platform Stack

Scaling LLMs with Triton and TensorRT-LLM (NVIDIA Platform) — the deployment-side complement: once training benchmarks identify the optimal model configuration, Triton + TensorRT-LLM deploy it at scale.
Performance Analysis — TensorRT LLM (NVIDIA Platform) — Nsight Systems integration for profiling the serving tier, analogous to DGX Cloud Benchmarking for the training tier.

Personal Wiki

Explorer