Performance Tuning Guide — Megatron-Bridge LLM Training

Cross-section page — Deployment and Scaling angle. See primary page for full summary, configuration settings, and parallelism selection guide.

Deployment and Scaling Angle

This guide is directly relevant to the deployment-and-scaling topic area because parallelism strategy selection and precision choices determine how effectively large models can be distributed across GPU clusters and what training TCO looks like at scale.

MFU and TCO as Deployment Metrics

Model FLOPS Utilisation (MFU) and Total Cost of Ownership (TCO) are the primary KPIs for large-scale LLM training deployments. The guide provides concrete trade-off data:

FP8 vs BF16: FP8 delivers 1.2–1.5× speedup and proportional cost savings; models trained in FP8 can also be deployed directly for FP8 inference, compounding savings across training and serving
GPU count scaling: near-linear scaling achievable with well-tuned parallelism configurations; strategic GPU count increases reduce time-to-train with minimal cost increase (Llama 3 70B: 115.4 days → 3.8 days for 97% time reduction at 2.6% cost increase)

Parallelism as a Scaling Architecture

The parallelism hierarchy is essentially a scaling architecture for LLM deployment:

Data Parallelism: the baseline scale-out strategy — add GPUs proportionally
Tensor + Context Parallelism: scale within a node boundary; keep within NVLink domain to avoid inter-node communication bottlenecks
Pipeline Parallelism: scale across nodes; accept pipeline bubble overhead; use VPP to reduce bubbles
Expert Parallelism: scale MoE models across GPUs without proportional communication cost increase

Choosing the wrong combination results in communication-bottlenecked training where adding more GPUs degrades MFU rather than improving it.

NVL72 Considerations

For GB200 NVL72 systems, the guide provides explicit topology-aware guidance: TP × DP product must divide evenly into the NVL72 configuration to avoid cross-domain communication bottlenecks. This is a concrete example of hardware-aware deployment configuration.

Connections

DGX Cloud Benchmarking — benchmarking suite that quantifies the MFU and TCO improvements described in this guide
Performance Analysis — TensorRT-LLM — inference-side counterpart; FP8 models trained with this guide’s settings can be deployed directly for FP8 inference profiled there

Personal Wiki

Explorer

Performance Tuning Guide — Megatron-Bridge LLM Training (Deployment and Scaling)

Performance Tuning Guide — Megatron-Bridge LLM Training

Deployment and Scaling Angle

MFU and TCO as Deployment Metrics

Parallelism as a Scaling Architecture

NVL72 Considerations

Connections

Graph View

Table of Contents

Backlinks