Section 4 of Generative AI LLM Exam Study Guide

LLMs training, customizing and inferencing

Abstract

This section addresses the critical infrastructure challenges required to train, customize, and infer on Large Language Models (LLMs) given their massive scale in parameters and tokens. It establishes the necessity of combining Model Parallelism with Data Parallelism to overcome GPU memory capacity constraints, detailing specific parallelization strategies such as Pipeline, Tensor, and Sequence parallelism. Furthermore, it analyzes optimization techniques including Fully Sharded Data Parallelism (FSDP) and Quantization Aware Training (QAT) to minimize communication overhead and enhance model accuracy in quantized representations.

Key Concepts

  • Efficient LLM Training Architecture: The fundamental architectural challenge involves managing models with approximately billions of parameters and trillions of tokens, which exceeds the memory capacity of single devices. The solution necessitates a hybrid approach combining Model Parallelism to handle parameter storage and Data Parallelism to handle dataset sharding, ensuring distributed computation feasibility.
  • Model Parallelism Mechanism: This technique partitions the model parameters and optimizer states across multiple GPUs so that each individual device stores only a subset of the total model parameters. This distribution allows the training of models that cannot fit into the memory of a single accelerator.
  • Pipeline Parallelism (Inter-layer): This method performs a vertical split of the model, dividing layers across GPUs in a pipeline fashion where each device computes for its assigned chunk and passes intermediate activations to the next stage. A primary limitation is the introduction of bubble time, where devices remain idle waiting for data, leading to a waste of computational resources.
  • Tensor Parallelism (Intra-layer): Operating horizontally, this strategy splits operations across GPUs to parallelize computation within a specific operation, such as matrix-matrix multiplication. This technique requires additional communication overhead to synchronize partial results and ensure the correctness of the final computation output.
  • Sequence Parallelism: This concept expands upon tensor-level parallelism by identifying regions within a transformer layer that are independent along the sequence dimension. By splitting these layers, both the compute and the activation memory can be distributed across tensor parallel devices, resulting in a smaller memory footprint.
  • Selective Activation Recomputation: This memory optimization strategy improves upon standard checkpointing by recognizing that different activations require different numbers of operations to recompute. Instead of checkpointing entire layers, it is possible to recompute only the specific parts that consume significant memory but are not computationally expensive to regenerate.
  • Standard Data Parallelism: In this configuration, the dataset is split into several shards allocated to different devices, with each device holding a full copy of the model replica. After the back-propagation step, the gradients of the model are all-reduced so that the model parameters on different devices remain synchronized.
  • Fully Sharded Data Parallelism (FSDP): FSDP is a variant of data parallelism that uniformly shards both model parameters and training data across workers, where computation for each micro-batch is local to the GPU. This configuration minimizes communication bubbles by aggressively overlapping communication with computation through operation reordering and parameter prefetching.
  • Quantization Aware Training (QAT): QAT trains the model using operations that mimic the quantization process during the training phase itself. This allows models to learn how to perform well in quantized representations, leading to improved accuracy when compared to models subjected only to post-training quantization.
  • Communication-Computation Overlap: A major challenge in parallel training is the overhead introduced by multiple GPUs, specifically the time spent communicating data versus the time spent computing. Techniques like FSDP are designed to minimize these bubbles by reordering operations to ensure communication happens in parallel with computation.

Key Equations and Algorithms

  • Pipeline Parallelism Execution: The algorithm splits model layers vertically, requiring each device to compute its chunk and pass intermediate activations to the next stage. This sequential dependency introduces bubble time where some devices are engaged in computation while others wait for data.
  • Tensor Parallelism Execution: This procedure splits matrix-matrix multiplication operations horizontally across GPUs to parallelize intra-layer computation. Additional communication is required after the split operations to aggregate partial results and ensure mathematical correctness.
  • Standard Data Parallelism Execution: The algorithm shards the dataset while maintaining a full model replica on each device. The procedure concludes with an all-reduce operation on the model gradients to synchronize parameters across the distributed cluster.
  • FSDP Sharding Strategy: The process configures sharding strategies to match the physical interconnect topology of the cluster. It restricts the number of blocks allocated for inflight unsharded parameters to optimize memory usage while aggressively overlapping communication with computation.
  • QAT Forward and Backward Pass: In the forward pass, the algorithm quantizes weights and activations to low-precision representations to mimic inference conditions. In the backward pass, the algorithm computes gradients using full-precision weights and activations to maintain stability during learning.
  • Selective Activation Recomputation Procedure: This method identifies activations within transformer layers that consume high memory but low compute to regenerate. The procedure checkpoints these specific parts rather than full layers, allowing more activations to be saved for the backward pass without exceeding memory constraints.

Key Claims and Findings

  • Scaling massively large AI models with billions of parameters and trillions of tokens comes with huge memory capacity requirements that single GPUs cannot satisfy.
  • Parallelism techniques inherently add communication or computation overhead, necessitating a configuration that achieves maximum performance before scaling training with data parallelism.
  • Pipeline parallelism introduces bubble time where devices wait for intermediate activations, leading to a waste of computational resources if not managed correctly.
  • Fully Sharded Data Parallelism (FSDP) minimizes bubbles to overlap communication with computation aggressively through operation reordering and parameter prefetching.
  • Quantization Aware Training leads to improved accuracy compared to post-training quantization because models learn to perform well in quantized representations during the training process.
  • Sequence parallelism enables the distribution of activation memory across tensor parallel devices by splitting layers along the sequence dimension.
  • Selective activation recomputation is viable for cases where memory constraints force the recomputation of some, but not all, of the activations.

Terminology

  • LLM (Large Language Model): A type of AI model characterized by approximately billions of parameters and trained on approximately trillions of tokens, requiring significant memory capacity.
  • Bubble Time: A period of inefficiency in parallel training where computational resources are idle while waiting for data or synchronization from other devices.
  • Inter-layer Parallelism: A synonym for Pipeline parallelism, where the model is split vertically by layers across different devices.
  • Intra-layer Parallelism: A synonym for Tensor parallelism, where operations within a single layer are split horizontally across devices.
  • All-Reduce: A synchronization operation used in data parallel training where the gradients of the model are combined across devices to update parameters.
  • FSDP (Fully Sharded Data Parallelism): An optimization technique that shards model parameters and training data uniformly across data parallel workers to minimize memory usage.
  • Parameter Prefetching: A strategy used in FSDP where parameters are fetched in advance to overlap communication latency with computation time.
  • QAT (Quantization Aware Training): A training methodology that mimics the quantization process during training to improve model performance in low-precision representations.
  • Activation Memory: The memory storage required for intermediate results during the forward pass, which can be reduced via sequence parallelism or recomputation.
  • Optimizer States: Part of the model training data that is partitioned across GPUs in model parallelism alongside model parameters to reduce storage requirements.