Ch. 3 — LLM Architecture

Abstract

This chapter establishes the fundamental architectural dichotomy between encoder and decoder components within the transformer framework, defining their distinct probabilistic outputs and contextual processing mechanisms. It delineates the formal input and output schemas characterizing modern large language models, specifically quantifying context windows and generation limits. The central technical argument posits that contemporary systems, such as GPT-4 and Claude, operate primarily as decoder-only models, necessitating specific agent design strategies to mitigate inherent architectural constraints. Understanding these limitations is critical for designing systems that function reliably within the bounded capacity of autoregressive generation.

Key Concepts

Encoder Architecture: Designed to generate useful encodings of all text by factoring in full context. It employs bi-directional embedding to consider both past and future information relative to a specific token. The output consists of rich contextual representations suitable for understanding tasks like classification. Models such as BERT exemplify this encoder-only approach focused on analysis.
Decoder Architecture: Functions to predict the next word given the context offered so far through autoregressive forecasting. It generates text one token at a time, strictly adhering to a unidirectional view where only previous tokens are visible. The output is a probability distribution over the next token, making it ideal for generation tasks. Examples include the GPT series, which are decoder-only models used for text creation.
Contextual Embeddings: Represent the output of the encoder process, defined mathematically as $p (e_{i} ∣ x_{1.. n})$ . These embeddings factor in the surrounding text to create a rich representation of meaning. They allow the model to understand and encode semantic information from the entire sequence simultaneously. This contrasts with the sequential nature of decoder representations.
Autoregressive Forecasting: The core process driving the decoder, calculating the probability $p (x_{n + 1} ∣ x_{1.. n})$ for the next token. It relies solely on the context offered so far, preventing access to future information during generation. This sequential dependency defines the generation flow of modern decoder-only models. It ensures that the text continues logically based on preceding tokens.
Input Schema Specifications: Dictate the expected format for LLM inputs, allowing for natural language text that can be freeform or structured. Modern systems support long input contexts extending up to 100K+ tokens. The input must conform to training formats, such as instruction templates or chat formats. Special tokens like , , and are often used to delineate sections.
Output Schema Characteristics: Define the characteristics of generated text, which should be natural language and task-appropriate. Ideally, the output represents a perfect completion aligned with instructions. Practically, it serves as a best-effort continuation trained for quality. Lengths are typically short, constrained to less than 8K tokens.
Decoder Limitations Framework: Identifies six critical constraints that are fundamental architectural features rather than bugs. These include struggles with out-of-domain formats, super-long inputs, and super-long outputs. Understanding these constraints is crucial for designing robust agent systems that work around these inherent boundaries. They drive specific architectural decisions in downstream applications.
Modern LLM Predominance: Refers to the observation that current state-of-the-art models like GPT-4 and Claude are primarily decoder-only. This shift impacts how text is generated, relying heavily on autoregressive methods. While encoder architecture is less common now, it remains important for understanding the conceptual foundation. It also retains utility for specific tasks such as classification.
Agent System Design: Requires working around decoder limitations through careful design of perception, reasoning, and action patterns. Since limitations like output length are not bugs, agents must implement iterative strategies. Solutions include generating content in chunks or summarizing long context to prevent performance degradation. This ensures the system remains consistent within the context window.
Training Format Conformity: States that models achieve best results with input patterns familiar during training. Deviations from these patterns can lead to suboptimal performance. Standard templates are recommended to maintain alignment with the model’s learned distributions. This adherence ensures the model processes the input as intended without format mismatch errors.
In-Context Self-Fulfilling: Describes the tendency of the model to stay consistent with context during generation. However, the system may contradict earlier context if not monitored. This characteristic implies that consistency checks are necessary during long generation tasks. It highlights the risk of the model drifting from the initial instructions over time.
Super-Long Context Handling: Addresses performance degradation that occurs with very long inputs exceeding standard limits. The prescribed solution involves summarizing or chunking content to fit within efficient processing bounds. This mitigates the loss of quality associated with processing excessive token volumes in a single pass. It is a necessary preprocessing step for complex document analysis.

Key Equations and Algorithms

Encoder Probability Expression: $p (e_{i} ∣ x_{1.. n})$ . This expression represents the probability of an encoding $e_{i}$ given the full sequence of text $x$ from index 1 to $n$ . It mathematically defines the bi-directional nature of the encoder, allowing it to condition on the entire context. This facilitates dense semantic understanding across the input sequence.
Decoder Probability Expression: $p (x_{n + 1} ∣ x_{1.. n})$ . This formula calculates the probability of the next token $x_{n + 1}$ conditioned on all previous tokens from 1 to $n$ . It encapsulates the autoregressive nature of the decoder, where future tokens are unavailable during the current step. This drives the sequential generation process token by token.
Autoregressive Generation Procedure: A process where text is generated one token at a time based on previous context. The algorithm iteratively predicts the next token distribution and samples from it. The complexity depends on the sequence length as each step depends on the output of the prior step. It enforces the unidirectional constraint of the decoder architecture.
Input Formatting Algorithm: A procedure to ensure input conforms to expected training formats like instruction templates. It involves wrapping user input with special tokens such as , , and . The algorithm checks context length to ensure it does not exceed the 100K+ token limit. This ensures the model processes the data without format errors.
Iterative Output Generation: A strategy to address the limitation of super-long outputs capped typically at 8K tokens. The procedure involves generating text in chunks while maintaining context continuity. Each chunk is processed sequentially to build the final response. This mitigates the architectural limit on single-pass output length.
Context Chunking Solution: A method to handle super-long context where performance degrades with very long inputs. The algorithm summarizes or divides content into smaller segments before processing. This prevents the model from suffering accuracy losses due to excessive input length. It allows the system to analyze documents longer than the native context window.

Key Claims and Findings

Modern LLMs like GPT-4 and Claude operate primarily as decoder-only models.
Decoder limitations are fundamental architectural constraints rather than software bugs.
Practical output limits for generated text are typically less than 8K tokens.
Input context support extends up to 100K+ tokens for modern models.
Encoder models create rich contextual representations using bi-directional embedding.
Agent systems must design perception, reasoning, and action patterns to work around limitations.
Performance degrades significantly when processing super-long input contexts.
Models struggle with out-of-domain formats not seen during the training phase.
Best results are achieved when inputs conform to standard training format templates.
Decoders cannot see future tokens, restricting them to unidirectional context analysis.

Terminology

Encoder: A transformer component that generates encodings factoring in full context using bi-directional embedding.
Decoder: A transformer component that predicts the next word given context offered so far via autoregressive forecasting.
Bi-directional: A property allowing the model to consider both past and future context during embedding creation.
Autoregressive: A generation process where text is produced one token at a time based on previous tokens.
Contextual Embeddings: Rich representations created by the encoder that factor in the full sequence context.
Input Schema: The expected format for inputs, involving natural language text and special tokens.
Output Schema: The characteristics of generated text, including length limits and task appropriateness.
Special Tokens: Symbols like , , used to structure chat or instruction formats.
In-Context Self-Fulfilling: The tendency of output to stay consistent with the provided context.
Out-of-Domain Formats: Input structures that the model struggles with because they were not seen in training.
Super-Long Context: Input data length that exceeds optimal performance thresholds, causing degradation.
Agent Design: The architectural planning of systems to navigate LLM limitations through iterative patterns.

Personal Wiki

Explorer

Ch. 3 — LLM Architecture

Abstract

Key Concepts

Key Equations and Algorithms

Key Claims and Findings

Terminology

Graph View

Table of Contents

Backlinks