Performance Tuning Guide — Megatron-Bridge LLM Training

NVIDIA NeMo Megatron-Bridge documentation — docs.nvidia.com/nemo/megatron-bridge

Abstract

This guide covers performance tuning techniques for large language model training on NVIDIA GPUs using Megatron-Bridge, the NeMo framework’s integration with Megatron-Core. It focuses on achieving high Model FLOPS Utilisation (MFU) and favourable Total Cost of Ownership (TCO) by selecting the right combination of precision formats, parallelism strategies, and advanced optimisations. The guide treats parallelism as a hierarchy of choices: start with data parallelism, escalate to tensor and context parallelism when memory is constrained, add pipeline parallelism when the model cannot fit otherwise, and apply expert parallelism for MoE models. Performance is shaped by GPU count, model architecture, hidden size, sequence length, and micro-batch size — all of which interact with the available parallelism and precision options.

Key Concepts

MFU (Model FLOPS Utilisation): ratio of observed compute throughput to theoretical peak FLOPS; the primary metric for training efficiency
FP8 training: 8-bit floating-point precision for linear layers within Transformer blocks, delivering 1.2–1.5× speedup over BF16 by leveraging Hopper/Blackwell Transformer Engine; speedup is lower for small hidden sizes where linear layers are a smaller fraction of total compute
Distributed Optimizer: shards master parameters and optimiser states across data-parallel ranks, reducing memory without increasing communication overhead vs. standard DDP
Tensor Parallelism (TP): shards weight tensors across GPUs within a node (NVLink domain); increases memory capacity but adds all-reduce communication overhead per forward/backward pass
Context Parallelism (CP): shards activations along the sequence dimension; effective when sequence length >> hidden size; generates lower communication volume than TP for the same tensor sizes
Pipeline Parallelism (PP): assigns different Transformer layers to different pipeline stages; necessary when TP alone cannot fit the model; Virtual Pipeline Parallelism (VPP) reduces pipeline bubbles
Expert Parallelism (EP): distributes MoE expert weights across GPUs; designed to remain within NVLink domain; combined with Expert Tensor Parallelism (ETP) for sparse MLP layers
FSDP (Fully Sharded Data Parallelism): PyTorch-native alternative to TP+PP+DP; preferred for small models with large sequences and for GB200 systems where activation offload to host via chip-to-chip interconnect is available

Parallelism Selection Guide

Scenario	Recommended strategy
Model fits in GPU memory	Data Parallelism (DP) only — minimal communication overhead
Model exceeds GPU memory	Add Tensor Parallelism (TP), confined to NVLink domain
Sequence length >> hidden size	Add Context Parallelism (CP) to shard activations
TP insufficient for model fit	Add Pipeline Parallelism (PP) + Virtual PP
MoE model	Expert Parallelism (EP) within NVLink domain
Small model, large sequence	FSDP (hides AllGather under compute)

Key Configuration Settings

Setting	Purpose
`TransformerConfig.tensor_model_parallel_size`	TP degree
`TransformerConfig.context_parallel_size`	CP degree
`TransformerConfig.pipeline_model_parallel_size`	PP degree
`TransformerConfig.virtual_pipeline_model_parallel_size`	VPP degree (reduces pipeline bubbles)
`TransformerConfig.expert_model_parallel_size`	EP degree for MoE
`OptimizerConfig.use_distributed_optimizer`	Enable distributed optimiser (default: true)
`DistributedInitConfig.use_torch_fsdp2`	Enable PyTorch FSDP2
`MixedPrecisionConfig.fp8_param`	Enable FP8 parameter storage

Key Claims and Findings

FP8 is limited to linear layers in the Transformer block; actual speedup depends on what fraction of total step time these layers consume — small models see less benefit
TP size should be confined to the NVLink domain (intra-node); using TP=8 for Llama 3 70B causes low GPU utilisation
Combining TP=2 + CP=2 can outperform TP=4 when sequence length > hidden size
For NVL72 systems: ensure TP × DP product divides evenly into the NVL72 configuration to avoid cross-domain communication bottlenecks
VPP reduces pipeline bubbles but increases inter-stage communication; a smaller VPP size is better when global batches contain many micro-batches
Asymmetric pipeline stage allocation (account_for_embedding_in_pipeline_split) improves load balance for models with large vocabularies

Terminology

Pipeline bubble: idle time in pipeline parallelism during warm-up and flush phases where some stages have no work
NVL72: NVIDIA GB200 NVL72 system (72 GPUs connected via NVLink); high-bandwidth domain requiring careful TP/DP alignment
MXFP8: microscaled FP8 with per-block quantisation (1×32 block size); less efficient than full tensor-wise FP8 due to higher quantisation overhead
Host performance-bound: training bottlenecked by CPU/host operations rather than GPU compute; occurs with many small GPU kernels

Connections to Existing Wiki Pages

Performance Tuning Guide (Deployment and Scaling angle) — cross-section covering MFU, TCO, and scale-out considerations
DGX Cloud Benchmarking — benchmarking suite that measures the framework-level performance improvements this guide describes
Performance Analysis — TensorRT-LLM — inference-side counterpart covering Nsight profiling for deployed models

Personal Wiki

Explorer

Performance Tuning Guide — Megatron-Bridge LLM Training

Performance Tuning Guide — Megatron-Bridge LLM Training

Abstract

Key Concepts

Parallelism Selection Guide

Key Configuration Settings

Key Claims and Findings

Terminology

Connections to Existing Wiki Pages

Graph View

Table of Contents

Backlinks