Section 5 of Generative AI LLM Exam Study Guide
Abstract
This section establishes the hierarchy of methods used to customize Large Language Models (LLMs), ranging from inference-time prompt manipulation to full parameter updates. The central technical argument posits a critical trade-off between the computational effort and dataset size required for customization versus the level of accuracy improvements desired for downstream tasks. Within the deck’s progression, this section is pivotal as it details the transition from freezing weights to actively training internal representations, culminating in advanced alignment techniques like Direct Preference Optimisation (DPO) and Reinforcement Learning with Human Feedback (RLHF).
Key Concepts
- Customization Spectrum and Trade-offs: Customization techniques are categorized based on the extent to which they alter model parameters, creating a spectrum from prompt engineering to full fine-tuning. The selection of a specific method is determined by weighing the constraints on dataset size and training effort against the necessary level of downstream task accuracy. Techniques that do not alter parameters offer lower latency and no catastrophic forgetting but may limit performance gains compared to methods that update weights.
- Prompt Engineering versus Prompt Learning: While both methods operate with frozen original weights, prompt engineering relies on manipulating inference-time input via few-shot examples or system instructions without model modification. In contrast, prompt learning involves training virtual prompt embeddings through gradient descent to impart task-specific knowledge, effectively learning the “prompt” rather than the text. This distinction allows prompt learning to adapt models to new tasks without overwriting foundational pretraining knowledge.
- Catastrophic Forgetting Mitigation: A primary motivation for using prompt learning and Parameter-Efficient Fine-Tuning (PEFT) is the avoidance of catastrophic forgetting, which occurs when a model loses foundational knowledge after learning new behaviors. Because the original model parameters remain frozen in these methods, the foundational knowledge gained during pretraining is preserved. This ensures that the model retains its general capabilities while specializing for specific use cases.
- Prompt Tuning Mechanism: In prompt tuning, soft prompt embeddings are initialized as a 2D matrix of size total_virtual_tokens by hidden_size for each specific task. These task-specific matrices are optimized during training while the base LLM parameters remain fixed, ensuring that tasks do not share parameters during training or inference. Once tuned, these embeddings are used alongside the standard text embeddings to condition the model’s generation.
- P-Tuning Architecture: P-tuning utilizes a specialized encoder, typically an LSTM or MLP model, to predict task-specific virtual token embeddings rather than storing them directly as a matrix. The prompt encoder parameters are randomly initialized and updated at each training step, while all base LLM parameters remain frozen. The resulting virtual tokens are stored in a lookup table keyed by task name for inference, and the encoder model itself is discarded post-training.
- Adapter Layer Mechanics: Adapter Learning modifies the transformer architecture by introducing small feed-forward layers between the core layers that are trained during fine-tuning. These modules generally employ a bottleneck architecture that projects the input state to a lower-dimensional space, applies a nonlinear activation function, and projects back to the original dimension. They are typically initialized such that their initial output is zero to ensure the model behaves unchanged before training begins.
- IA3 Scaling Method: IA3 introduces fewer parameters than Adapters or LoRA by using learned scaling vectors to rescale hidden representations within the transformer layer. These vectors, denoted as , and , rescale the keys and values in attention mechanisms and the inner activations in feed-forward networks respectively. This method significantly reduces trainable parameters while enabling mixed-task batches and the ability to merge updates with base weights for zero latency overhead at inference.
- LoRA Decomposition Strategy: LoRA injects trainable low-rank matrices into transformer layers to approximate weight updates rather than updating the full pretrained weight matrix . Specifically, it focuses on updating the query and value projection weight matrices in the multi-head attention sub-layer using low-rank decomposition. This approach reduces the memory footprint required for training while maintaining competitive performance compared to full fine-tuning.
- Supervised Fine-Tuning (SFT): SFT involves updating all the model’s parameters on labeled data of inputs and outputs to teach domain-specific terms and instruction following. Often referred to as instruction tuning, it combines fine-tuning with prompting paradigms by blending natural language instructions across multiple NLP datasets. This process typically precedes reinforcement learning stages and substantially improves zero-shot performance on unseen tasks at inference time.
- Direct Preference Optimization (DPO): DPO simplifies the alignment process by training an actor model against a frozen reference model using human preference data without an explicit reward training phase. In full-parameter DPO, the actor is the reference model plus trainable weights, whereas LoRA-based DPO initializes the actor with reference model weights plus LoRA weights. This eliminates the need for complex Reinforcement Learning with Human Feedback pipelines while maintaining alignment with human preferences.
Key Equations and Algorithms
- Prompt Tuning Embedding Dimensions: The prompt tuning process defines a task-specific embedding matrix initialized with dimensions . This mathematical representation ensures that each task has its own associated 2D embedding matrix that operates independently without sharing parameters with other tasks. The optimization of these embeddings occurs via gradient descent while the base model remains static.
- P-Tuning Decoder Output: The prompt encoder in P-tuning generates outputs defined as task-specific virtual token embeddings passed as a 1D vector into the LLM. The input to this encoder is the task name, and the output dimension corresponds to the hidden size of the model. This allows for dynamic generation of soft prompts based on the task identity without modifying the transformer weights.
- IA3 Rescaling Operations: The rescaling mechanism in IA3 applies learned vectors to specific components of the transformer architecture. These vectors rescale the keys and values in attention mechanisms and the inner activations in position-wise feed-forward networks. The mathematical operation effectively scales the hidden state representation by before passing it to the next layer.
- LoRA Weight Update Approximation: LoRA approximates the weight update of a pretrained matrix using low-rank matrices and such that . Specifically, this update is applied to the query and value projection weight matrices in the multi-head attention sub-layer. This decomposition significantly reduces the number of trainable parameters compared to updating directly.
- LoRA-based DPO Parameterization: In LoRA-based DPO, the parameters of the actor model are defined as the sum of the frozen reference model parameters and trainable LoRA weights . This is expressed as , ensuring that only the LoRA weights are updated during the training process. This structure maintains the reference model for calculating logprobs required for the loss function.
- RLHF Three-Stage Pipeline: The Reinforcement Learning with Human Feedback algorithm is structured as a sequential three-stage process: SFT, Reward Modeling, and Policy Optimization. Stage 1 involves fine-tuning the SFT model with instructions. Stage 2 trains a Reward Model (RM) to predict human preferences. Stage 3 fine-tunes the policy using Proximal Policy Optimization (PPO) against the RM.
- DPO Loss Penalty: The DPO algorithm utilizes a KL-penalty loss calculated using logprobs derived from a frozen reference model. The reference model is initialized with the SFT model weights and remains frozen during the DPO training of the actor. This penalty term prevents the policy from deviating too far from the original language distribution while optimizing for preferences.
Key Claims and Findings
- Prompt engineering constitutes customization strictly at inference time by manipulating the prompt without altering model parameters, whereas parameter-efficient fine-tuning introduces trainable layers to the architecture.
- Catastrophic forgetting is avoided when using prompt learning and PEFT methods because the original LLM weights are kept frozen throughout the training process.
- Few-shot prompting increases inference latency because sample prompt and completion pairs are prepended to the prompt, consuming additional compute resources during generation.
- Instruction tuning successfully combines the strengths of fine-tuning and prompting to improve LLM zero-shot performance on unseen tasks evaluated during inference.
- IA3 reduces the number of trainable parameters significantly more than LoRA and Adapters because it only updates rescaling vectors rather than low-rank matrices or bottleneck layers.
- The learning rescaling vectors in IA3 can be merged with the base weights, resulting in no architectural change and no additional latency during the inference phase.
- Direct Preference Optimisation simplifies the alignment process by removing the need for a separate reward model training stage inherent to the standard RLHF pipeline.
Terminology
- Catastrophic forgetting: A phenomenon where a Large Language Model learns new behavior during fine-tuning at the cost of foundational knowledge gained during pretraining, often mitigated by freezing weights.
- Virtual prompt embeddings: Optimized 2D matrices or 1D vectors used in prompt learning methods to condition the model on specific tasks without modifying the base architecture.
- Prompt tuning: A technique where soft prompt embeddings are initialized as a matrix and optimized via gradient descent for a specific task while freezing the base model.
- Parameter-Efficient Fine-Tuning (PEFT): A class of techniques that introduce a small number of trainable parameters or layers to an existing LLM architecture for use-case–specific data.
- Supervised Fine-Tuning (SFT): The process of updating a pretrained model’s parameters on a mixture of datasets expressed through natural language instructions to improve instruction following.
- Reward Model (RM): A model trained on datasets of prompts with multiple responses ranked by humans to predict human preference scores for Reinforcement Learning with Human Feedback.
- Proximal Policy Optimization (PPO): A reinforcement learning algorithm used in the third stage of RLHF to fine-tune the policy model against the reward model’s preferences.
- Direct Preference Optimisation (DPO): An alignment method that trains an actor model against a frozen reference model using preference data directly, bypassing explicit reward training.
- Steering labels: Categorizations used in SteerLM to map responses into labeled categories associated with human preferences at inference time, rather than solely reinforcing good behaviors.
- Low-Rank Decomposition: A mathematical approach used in LoRA where a weight matrix is updated via the product of two smaller matrices to approximate the full weight update.