Section 3 of Generative AI LLM Exam Study Guide
Abstract
This section details the methodological framework for evaluating and training generative language models, bridging statistical experimentation with deep learning architectures. It establishes the protocols for empirical validation, including A/B testing and cross-validation, while defining the core mechanisms of representation learning through tokenization, embeddings, and the Transformer architecture. Furthermore, it outlines the optimization strategies for large-scale models, such as regularization and decoding techniques, and addresses the scaling laws and safety implications inherent in generative AI systems.
Key Concepts
- A/B Testing in Machine Learning: This statistical approach compares a control model against a new variant to evaluate if observed performance differences are statistically significant rather than due to random chance. The organization must define a comparison metric and split users 50%-50% (or 30%-70%) into groups simultaneously for a predetermined duration. It is noted that A/B testing is ineffective for testing large changes, such as new goods or wholly new user experiences, requiring rigorous sample size calculation based on variance and expected difference.
- Zero-shot Learning (ZSL): ZSL is defined as a problem setup where the learner observes samples from classes at test time which were not observed during training. Methods rely on auxiliary information, such as textual descriptions or class-class similarity in a continuous space, to associate observed and non-observed classes. Generalized ZSL extends this by allowing samples from both new and known classes at test time, often employing gating modules to distinguish between them.
- Cross-Validation Methods: These methods evaluate how well a learner generalizes to independent unseen data, identifying underfitting or overfitting. Non-exhaustive methods include the Holdout Method, which suffers from high variance, and K-Fold Cross Validation, where the dataset is split into subsets and the error is averaged over trials. Exhaustive methods compute all possible splits, such as Leave-One-Out Cross Validation, which is computationally intensive but avoids variance issues.
- Tokenization and Normalization: Tokenization segments running text into words, subwords, or characters, utilizing top-down standards like Penn Treebank or bottom-up statistics like Byte-Pair Encoding (BPE). Normalization includes case folding, stemming, and lemmatization to convert raw text into a structured format; lemmatization yields lexicographically correct roots, whereas stemming is a naive stripping of affixes that may not preserve valid words.
- Word Embeddings and Similarity: Embeddings represent words as vectors in a multi-dimensional space where distance reflects semantic similarity. Prediction-based methods like Word2Vec (CBOW and Skip-gram) and frequency-based methods like TF-IDF are used to generate these vectors. Cosine similarity is the standard metric for computing the relationship between two vectors, defined by the normalized dot product.
- Transformer Architecture: Unlike recurrent networks, Transformers rely on self-attention to process sequential data non-sequentially, allowing for efficient parallelization. The architecture utilizes Query, Key, and Value matrices to determine relevance, with positional encodings added to word embeddings to retain sequence order information since the model lacks recurrence.
- Training and Decoding Strategies: Models are trained using self-supervision and teacher forcing, where the correct history is provided to predict the next word. During generation, decoding strategies range from greedy selection (deterministic argmax) to sampling methods like Top-k or Top-p (nucleus) sampling, which trade off quality for diversity by selecting words from a probability distribution.
Key Equations and Algorithms
- A/B Test Sample Size: , where is the sample standard deviation and is the expected difference between the control and treatment. This formula determines the necessary number of subjects to ensure statistical validity.
- Hypothesis Testing for A/B: The Null Hypothesis () assumes random chance, while the Alternative Hypothesis () assumes a real difference. A significance level is standard; if the p-value is less than 0.05, is discarded in favor of .
- Cosine Similarity: . This metric computes the cosine of the angle between two vectors to determine their contextual similarity, where a value of 1 implies overlapping vectors and 0 implies orthogonality.
- Perplexity: The perplexity of a test set is the geometric mean of the inverse test set probability computed by the language model. It serves as an intrinsic evaluation metric for intrinsic language model quality, calculated over the sequence of words.
- Attention Score: The relevance of an encoder state to a decoder state is computed by a score function, often dot-product attention . This score is then normalized via softmax to derive context vectors dynamically for each decoding step.
- Scaling Laws: The performance (loss) of a large language model scales as a power-law with model size (parameters ), dataset size (), and compute budget (). Specifically, large models like GPT-3 utilize billions of parameters (e.g., 175 billion) trained on extensive web text corpora to minimize loss.
Key Claims and Findings
- Statistical Significance Requirement: A lower significance level indicates an underlying difference between the baseline and the control, meaning the observed change is likely not due to random chance if the p-value falls below 0.05.
- Limitations of Baseline Evaluation: Cross-validation provides a better estimate of generalization error than residual evaluation alone, which only indicates performance on data used for training.
- Transformer Efficiency: Transformers overcome the vanishing gradient problems of simple recurrent networks (SRNs) by using gated architectures like LSTMs or self-attention mechanisms that explicitly control information flow.
- BERT Pre-training Objective: BERT utilizes a Masked Language Modeling (MLM) objective where 80% of tokens are replaced with [MASK], 10% with random tokens, and 10% remain unchanged, forcing the model to fuse left and right context bidirectionally.
- Contextual Representation Superiority: Contextual embeddings, which represent word instances in specific contexts, outperform static embeddings for tasks like Word Sense Disambiguation (WSD), where the correct sense is selected via nearest-neighbor algorithms.
- Generative Decoding Trade-offs: Sampling methods like Top-k and Temperature sampling are used to generate diverse text, whereas greedy decoding is deterministic but may lead to repetitive or low-diversity outputs.
Terminology
- Lemma: The citation form or infinitive form of a verb (e.g., “be” for “are, is, am”), which may have multiple meanings referred to as word senses.
- Word Sense Disambiguation (WSD): The task of determining which sense of a polysemous word is being used in a particular context, often mapped to discrete lists like WordNet.
- Masked Language Modeling (MLM): A pre-training objective where the model predicts original inputs for masked tokens in an input sequence, enabling bidirectional context understanding.
- Next Sentence Prediction (NSP): A pre-training task where the model predicts whether two presented sentences are actually adjacent in the training corpus or unrelated, using [SEP] and [CLS] tokens.
- RoPE (Rotary Position Embeddings): A technique that combines absolute and relative position embeddings using rotation matrices, allowing for extrapolation to longer sequences at inference time compared to sine-based absolute encoding.
- ALiBi (Attention with Linear Biases): A positional encoding technique that does not add embeddings to word vectors but instead biases query-key attention scores with a penalty proportional to the distance between tokens.
- Teacher Forcing: A training strategy where the model is always given the correct history sequence to predict the next word, rather than using its own previous predictions.
- Self-Attention: An attention mechanism that relates different positions of a single sequence to compute a representation, allowing each token to attend to all other tokens in the sequence.
- Cross-Attention: An attention mechanism used in encoder-decoder models where queries come from the decoder and keys/values come from the encoder output.
- Regularization: Any mechanism that reduces overfitting, including L1 (absolute weight penalty), L2 (squared weight penalty), Dropout (random unit removal), and Early Stopping.