Section 8 of Generative AI LLM Exam Study Guide
Abstract
This section details Retrieval-Augmented Generation (RAG), a technique designed to enhance the accuracy and reliability of generative AI models by fetching facts from external sources rather than relying solely on parameterized knowledge. It establishes the architectural pipeline comprising offline document ingestion and embedding followed by online query retrieval and response generation. The discussion covers critical design trade-offs between fine-tuning and RAG, methods to improve retrieval accuracy without parameter updates, and specific strategies for optimizing latency and memory usage at scale. Finally, it outlines rigorous evaluation frameworks using metrics such as Recall@K and NDCG to ensure enterprise-grade system performance.
Key Concepts
- RAG Architecture: The system operates through a recurring pipeline with two distinct phases. The offline phase handles document ingestion, preprocessing, and embedding generation to prepare the knowledge base. The online phase involves an inference workflow where a user query is processed, relevant context is retrieved from the vector database, and the Large Language Model (LLM) generates a response leveraging that external information.
- Embedding Models and Search Types: Dense embeddings express the semantic structure of text, serving as the heart of every retrieval pipeline. Search can be symmetric, where queries and corpus entries are of similar length, or asymmetric, where a short query seeks a longer paragraph answer. NVIDIA Retrieval QA Embedding Models utilize fine-tuned versions of E5-Large-Unsupervised to handle these short-form query and long-form passage distinctions effectively.
- Document Transformation and Chunking: To ensure retrieved content is semantically relevant with minimal noise, documents must be transformed into smaller segments. Chunking methods include fixed-size splitting, sentence splitting using libraries like NLTK, and recursive chunking which divides text hierarchically using separators. Semantic chunking further utilizes embeddings to group sentences based on topic distance to prevent thematic discontinuity within a chunk.
- Vector Indexing Strategies: To handle large datasets efficiently, Approximate Nearest Neighbor (ANN) methods are employed within vector databases. Index types include IVF-Flat for datasets fitting in GPU memory and IVF-PQ for larger datasets requiring compression via product quantization. Graph-based methods like CAGRA are optimized for small-batch cases, balancing recall speed and cost.
- System Acceleration Techniques: Low latency is crucial for responsive chatbots and can be achieved through GPU acceleration and model optimization. NVIDIA NeMo Retriever accelerates indexing and retrieval, while TensorRT-LLM optimizes LLM deployment for inference efficiency. Deduplication and chunking can also be accelerated using NVIDIA GPUs to perform parallel data frame operations and min hashing.
- Memory Management in Deployment: Deploying RAG at scale requires hosting LLMs, embedding models, and vector databases simultaneously, placing heavy demands on GPU memory. The Key-Value (KV) cache, used to cache self-attention tensors, scales with batch size and context length; for instance, batching Llama-2-70B with a 4096 context size can require substantial memory for the cache alone. Model sharding across multiple GPUs is often necessary but adds latency and complexity.
- Retrieval Evaluation Metrics: Assessing the retriever’s effectiveness is critical for enterprise-grade applications. Metrics like Recall@K measure the percentage of relevant results retrieved within the top K items, while Normalized Discounted Cumulative Gain (NDCG) considers the rank order of results. BEIR and MTEB benchmarks are standard proxies for evaluating embedding models across various retrieval tasks.
- Orchestration and State Management: Frameworks like LangChain provide building blocks for setting up systems with multiple LLM components. This includes the concept of a Running State Chain, where a dictionary maintains variables across the system, allowing branches to degenerate state into responses. This structure supports complex chains that operate in multi-pass capacities to accumulate useful information.
Key Equations and Algorithms
- LLM Latency Formula: . This expression defines the total response time where TTFT is the Time to First Token and TPOT is the Time per Output Token, indicating that smaller models reduce both components.
- Recall@K Calculation: . This metric measures the percentage of relevant results found within the system’s output, ignoring the rank order, as demonstrated where two relevant chunks retrieved from a set of three yields a recall of 0.66.
- NDCG Computation: . This algorithm assigns relevance scores to retrieved items, applies a discount based on position to incentivize better ranking, and normalizes the score across queries with different numbers of relevant chunks.
- Approximate Nearest Neighbor Search: . These algorithms organize vectors into clusters or graphs to locate the most relevant vectors without performing an exact search, trading off some recall for increased search speed.
- Cluster Formation: . K-means splits data into K clusters based on centroid proximity, while DBSCAN forms clusters based on density to distinguish outliers, optimizing vector organization for search.
Key Claims and Findings
- RAG serves as a sophisticated form of prompt engineering that enhances LLM prompts with external database information to mitigate hallucinations.
- Fine-tuning customizes a pretrained LLM by updating most parameters for specific domains, but RAG offers a resource-efficient alternative for accessing real-time data.
- Performance degrades significantly when relevant information appears in the middle of long contexts, known as the “Lost in the Middle” phenomenon.
- Batch size is directly proportional to KV cache size, creating large GPU memory requirements when deploying models like Llama-2-70B with large context windows.
- Bi-encoders produce individual sentence embeddings for efficient comparison, whereas Cross-encoders score sentence pairs simultaneously but cannot process individual embeddings.
- RAGAs provides a reference-free evaluation framework using LLMs under the hood to measure context precision, context recall, and faithfulness.
- Asymmetric semantic search is appropriate for scenarios where a short query requires a match against longer document paragraphs.
- Security in RAG applications is maintained through Role-Based Access Control (RBAC) and guardrails like NeMo Guardrails to manage inputs and outputs.
Terminology
- TTFT (Time to First Token): The initial latency measurement representing the time between a user query and the generation of the first output token by the LLM.
- TPOT (Time per Output Token): The latency measurement for generating each subsequent token after the first one, contributing to total response duration.
- KV Cache (Key-Value Cache): Memory used to cache self-attention tensors during processing to avoid redundant computation, with size scaling based on batch size and context length.
- Bi-Encoder: An embedding model architecture that processes sentences independently to produce vector embeddings suitable for fast semantic search and clustering.
- Cross-Encoder: An architecture that simultaneously processes sentence pairs to output a similarity score, typically used for re-ranking search results rather than initial retrieval.
- NDCG (Normalized Discounted Cumulative Gain): A rank-aware metric that evaluates retrieval quality by considering both the relevance score of items and their position in the result list.
- RAGAS (Retrieval-Augmented Generation Assessment): A suite of tools and metrics for evaluating RAG systems, including context precision and answer relevance, often implemented as an LLM-as-a-Judge.
- ANN (Approximate Nearest Neighbor): A search method used in vector databases to find vectors that are similar to a query by relaxing the requirement for exact matches to improve speed.