Performance Tuning Guide — Megatron-Bridge LLM Training

NVIDIA NeMo Megatron-Bridge documentation — docs.nvidia.com/nemo/megatron-bridge

Abstract

This guide covers performance tuning techniques for large language model training on NVIDIA GPUs using Megatron-Bridge, the NeMo framework’s integration with Megatron-Core. It focuses on achieving high Model FLOPS Utilisation (MFU) and favourable Total Cost of Ownership (TCO) by selecting the right combination of precision formats, parallelism strategies, and advanced optimisations. The guide treats parallelism as a hierarchy of choices: start with data parallelism, escalate to tensor and context parallelism when memory is constrained, add pipeline parallelism when the model cannot fit otherwise, and apply expert parallelism for MoE models. Performance is shaped by GPU count, model architecture, hidden size, sequence length, and micro-batch size — all of which interact with the available parallelism and precision options.

Key Concepts

  • MFU (Model FLOPS Utilisation): ratio of observed compute throughput to theoretical peak FLOPS; the primary metric for training efficiency
  • FP8 training: 8-bit floating-point precision for linear layers within Transformer blocks, delivering 1.2–1.5× speedup over BF16 by leveraging Hopper/Blackwell Transformer Engine; speedup is lower for small hidden sizes where linear layers are a smaller fraction of total compute
  • Distributed Optimizer: shards master parameters and optimiser states across data-parallel ranks, reducing memory without increasing communication overhead vs. standard DDP
  • Tensor Parallelism (TP): shards weight tensors across GPUs within a node (NVLink domain); increases memory capacity but adds all-reduce communication overhead per forward/backward pass
  • Context Parallelism (CP): shards activations along the sequence dimension; effective when sequence length >> hidden size; generates lower communication volume than TP for the same tensor sizes
  • Pipeline Parallelism (PP): assigns different Transformer layers to different pipeline stages; necessary when TP alone cannot fit the model; Virtual Pipeline Parallelism (VPP) reduces pipeline bubbles
  • Expert Parallelism (EP): distributes MoE expert weights across GPUs; designed to remain within NVLink domain; combined with Expert Tensor Parallelism (ETP) for sparse MLP layers
  • FSDP (Fully Sharded Data Parallelism): PyTorch-native alternative to TP+PP+DP; preferred for small models with large sequences and for GB200 systems where activation offload to host via chip-to-chip interconnect is available

Parallelism Selection Guide

ScenarioRecommended strategy
Model fits in GPU memoryData Parallelism (DP) only — minimal communication overhead
Model exceeds GPU memoryAdd Tensor Parallelism (TP), confined to NVLink domain
Sequence length >> hidden sizeAdd Context Parallelism (CP) to shard activations
TP insufficient for model fitAdd Pipeline Parallelism (PP) + Virtual PP
MoE modelExpert Parallelism (EP) within NVLink domain
Small model, large sequenceFSDP (hides AllGather under compute)

Key Configuration Settings

SettingPurpose
TransformerConfig.tensor_model_parallel_sizeTP degree
TransformerConfig.context_parallel_sizeCP degree
TransformerConfig.pipeline_model_parallel_sizePP degree
TransformerConfig.virtual_pipeline_model_parallel_sizeVPP degree (reduces pipeline bubbles)
TransformerConfig.expert_model_parallel_sizeEP degree for MoE
OptimizerConfig.use_distributed_optimizerEnable distributed optimiser (default: true)
DistributedInitConfig.use_torch_fsdp2Enable PyTorch FSDP2
MixedPrecisionConfig.fp8_paramEnable FP8 parameter storage

Key Claims and Findings

  • FP8 is limited to linear layers in the Transformer block; actual speedup depends on what fraction of total step time these layers consume — small models see less benefit
  • TP size should be confined to the NVLink domain (intra-node); using TP=8 for Llama 3 70B causes low GPU utilisation
  • Combining TP=2 + CP=2 can outperform TP=4 when sequence length > hidden size
  • For NVL72 systems: ensure TP × DP product divides evenly into the NVL72 configuration to avoid cross-domain communication bottlenecks
  • VPP reduces pipeline bubbles but increases inter-stage communication; a smaller VPP size is better when global batches contain many micro-batches
  • Asymmetric pipeline stage allocation (account_for_embedding_in_pipeline_split) improves load balance for models with large vocabularies

Terminology

  • Pipeline bubble: idle time in pipeline parallelism during warm-up and flush phases where some stages have no work
  • NVL72: NVIDIA GB200 NVL72 system (72 GPUs connected via NVLink); high-bandwidth domain requiring careful TP/DP alignment
  • MXFP8: microscaled FP8 with per-block quantisation (1×32 block size); less efficient than full tensor-wise FP8 due to higher quantisation overhead
  • Host performance-bound: training bottlenecked by CPU/host operations rather than GPU compute; occurs with many small GPU kernels

Connections to Existing Wiki Pages