Performance Tuning Guide — Megatron-Bridge LLM Training
NVIDIA NeMo Megatron-Bridge documentation — docs.nvidia.com/nemo/megatron-bridge
Abstract
This guide covers performance tuning techniques for large language model training on NVIDIA GPUs using Megatron-Bridge, the NeMo framework’s integration with Megatron-Core. It focuses on achieving high Model FLOPS Utilisation (MFU) and favourable Total Cost of Ownership (TCO) by selecting the right combination of precision formats, parallelism strategies, and advanced optimisations. The guide treats parallelism as a hierarchy of choices: start with data parallelism, escalate to tensor and context parallelism when memory is constrained, add pipeline parallelism when the model cannot fit otherwise, and apply expert parallelism for MoE models. Performance is shaped by GPU count, model architecture, hidden size, sequence length, and micro-batch size — all of which interact with the available parallelism and precision options.
Key Concepts
- MFU (Model FLOPS Utilisation): ratio of observed compute throughput to theoretical peak FLOPS; the primary metric for training efficiency
- FP8 training: 8-bit floating-point precision for linear layers within Transformer blocks, delivering 1.2–1.5× speedup over BF16 by leveraging Hopper/Blackwell Transformer Engine; speedup is lower for small hidden sizes where linear layers are a smaller fraction of total compute
- Distributed Optimizer: shards master parameters and optimiser states across data-parallel ranks, reducing memory without increasing communication overhead vs. standard DDP
- Tensor Parallelism (TP): shards weight tensors across GPUs within a node (NVLink domain); increases memory capacity but adds all-reduce communication overhead per forward/backward pass
- Context Parallelism (CP): shards activations along the sequence dimension; effective when sequence length >> hidden size; generates lower communication volume than TP for the same tensor sizes
- Pipeline Parallelism (PP): assigns different Transformer layers to different pipeline stages; necessary when TP alone cannot fit the model; Virtual Pipeline Parallelism (VPP) reduces pipeline bubbles
- Expert Parallelism (EP): distributes MoE expert weights across GPUs; designed to remain within NVLink domain; combined with Expert Tensor Parallelism (ETP) for sparse MLP layers
- FSDP (Fully Sharded Data Parallelism): PyTorch-native alternative to TP+PP+DP; preferred for small models with large sequences and for GB200 systems where activation offload to host via chip-to-chip interconnect is available
Parallelism Selection Guide
| Scenario | Recommended strategy |
|---|---|
| Model fits in GPU memory | Data Parallelism (DP) only — minimal communication overhead |
| Model exceeds GPU memory | Add Tensor Parallelism (TP), confined to NVLink domain |
| Sequence length >> hidden size | Add Context Parallelism (CP) to shard activations |
| TP insufficient for model fit | Add Pipeline Parallelism (PP) + Virtual PP |
| MoE model | Expert Parallelism (EP) within NVLink domain |
| Small model, large sequence | FSDP (hides AllGather under compute) |
Key Configuration Settings
| Setting | Purpose |
|---|---|
TransformerConfig.tensor_model_parallel_size | TP degree |
TransformerConfig.context_parallel_size | CP degree |
TransformerConfig.pipeline_model_parallel_size | PP degree |
TransformerConfig.virtual_pipeline_model_parallel_size | VPP degree (reduces pipeline bubbles) |
TransformerConfig.expert_model_parallel_size | EP degree for MoE |
OptimizerConfig.use_distributed_optimizer | Enable distributed optimiser (default: true) |
DistributedInitConfig.use_torch_fsdp2 | Enable PyTorch FSDP2 |
MixedPrecisionConfig.fp8_param | Enable FP8 parameter storage |
Key Claims and Findings
- FP8 is limited to linear layers in the Transformer block; actual speedup depends on what fraction of total step time these layers consume — small models see less benefit
- TP size should be confined to the NVLink domain (intra-node); using TP=8 for Llama 3 70B causes low GPU utilisation
- Combining TP=2 + CP=2 can outperform TP=4 when sequence length > hidden size
- For NVL72 systems: ensure TP × DP product divides evenly into the NVL72 configuration to avoid cross-domain communication bottlenecks
- VPP reduces pipeline bubbles but increases inter-stage communication; a smaller VPP size is better when global batches contain many micro-batches
- Asymmetric pipeline stage allocation (
account_for_embedding_in_pipeline_split) improves load balance for models with large vocabularies
Terminology
- Pipeline bubble: idle time in pipeline parallelism during warm-up and flush phases where some stages have no work
- NVL72: NVIDIA GB200 NVL72 system (72 GPUs connected via NVLink); high-bandwidth domain requiring careful TP/DP alignment
- MXFP8: microscaled FP8 with per-block quantisation (1×32 block size); less efficient than full tensor-wise FP8 due to higher quantisation overhead
- Host performance-bound: training bottlenecked by CPU/host operations rather than GPU compute; occurs with many small GPU kernels
Connections to Existing Wiki Pages
- Performance Tuning Guide (Deployment and Scaling angle) — cross-section covering MFU, TCO, and scale-out considerations
- DGX Cloud Benchmarking — benchmarking suite that measures the framework-level performance improvements this guide describes
- Performance Analysis — TensorRT-LLM — inference-side counterpart covering Nsight profiling for deployed models