Performance Tuning Guide — Megatron-Bridge LLM Training
Cross-section page — Deployment and Scaling angle. See primary page for full summary, configuration settings, and parallelism selection guide.
Deployment and Scaling Angle
This guide is directly relevant to the deployment-and-scaling topic area because parallelism strategy selection and precision choices determine how effectively large models can be distributed across GPU clusters and what training TCO looks like at scale.
MFU and TCO as Deployment Metrics
Model FLOPS Utilisation (MFU) and Total Cost of Ownership (TCO) are the primary KPIs for large-scale LLM training deployments. The guide provides concrete trade-off data:
- FP8 vs BF16: FP8 delivers 1.2–1.5× speedup and proportional cost savings; models trained in FP8 can also be deployed directly for FP8 inference, compounding savings across training and serving
- GPU count scaling: near-linear scaling achievable with well-tuned parallelism configurations; strategic GPU count increases reduce time-to-train with minimal cost increase (Llama 3 70B: 115.4 days → 3.8 days for 97% time reduction at 2.6% cost increase)
Parallelism as a Scaling Architecture
The parallelism hierarchy is essentially a scaling architecture for LLM deployment:
- Data Parallelism: the baseline scale-out strategy — add GPUs proportionally
- Tensor + Context Parallelism: scale within a node boundary; keep within NVLink domain to avoid inter-node communication bottlenecks
- Pipeline Parallelism: scale across nodes; accept pipeline bubble overhead; use VPP to reduce bubbles
- Expert Parallelism: scale MoE models across GPUs without proportional communication cost increase
Choosing the wrong combination results in communication-bottlenecked training where adding more GPUs degrades MFU rather than improving it.
NVL72 Considerations
For GB200 NVL72 systems, the guide provides explicit topology-aware guidance: TP × DP product must divide evenly into the NVL72 configuration to avoid cross-domain communication bottlenecks. This is a concrete example of hardware-aware deployment configuration.
Connections
- DGX Cloud Benchmarking — benchmarking suite that quantifies the MFU and TCO improvements described in this guide
- Performance Analysis — TensorRT-LLM — inference-side counterpart; FP8 models trained with this guide’s settings can be deployed directly for FP8 inference profiled there