KV Caching Explained: Optimizing Transformer Inference Efficiency

Hugging Face blog post by Not Lain, published 2025-01-30. Introductory-level treatment of KV caching mechanics.

Abstract

KV caching is a technique that eliminates redundant attention computation during autoregressive text generation. In a standard transformer, generating each new token requires recomputing key and value projections for every preceding token — work that grows quadratically with sequence length. KV caching breaks this by storing those intermediate key and value tensors after they are first computed, so each subsequent decode step only needs to compute Q, K, V for the single new token and fetch the rest from the cache. The result is dramatically faster generation, particularly for long sequences, at the cost of increased memory usage. A benchmark on a T4 GPU demonstrated roughly 5× speedup over standard inference for a 300-token generation task.

Key Concepts

Autoregressive generation: tokens are produced one at a time; each step conditions on all previous tokens
Key (K) and Value (V) tensors: intermediate representations computed per-token in the attention mechanism; immutable once computed (previous tokens never change their K/V under causal masking)
Query (Q) tensor: computed only for the current new token at each decode step
KV cache: a data structure that accumulates K and V tensors as tokens are generated, growing by one entry per decode step per layer
Cache hit vs. miss: terminology for whether a needed K/V entry is present; a miss forces recomputation (see NVFP4 KV Cache for cache eviction dynamics)

Key Equation

The scaled dot-product attention formula from Attention Is All You Need:

$Attention (Q, K, V) = softmax (\frac{Q K ^{⊤}}{d _{k}}) V$

With KV caching, on decode step $t$ : Q is $[1, 1, d_{k}]$ (current token only), while K and V are $[1, t, d_{k}]$ (retrieved from cache). Without the cache, K and V would each be $[1, t, d_{k}]$ and recomputed from scratch each step.

Key Claims and Findings

Speedup is real and measurable: ~5.21× faster than standard inference on T4 GPU for 300-token generation
Memory grows linearly with sequence length: cache size = 2 × layers × heads × seq_len × head_dim × dtype_bytes; this is the primary constraint for long contexts
KV caching is applied per-layer: each transformer block has its own attention heads and therefore its own K/V cache
Only K and V are cached, not Q: Q changes at every decode step (it depends on the current token); K and V for past tokens never change due to causal masking
HuggingFace transformers enables it by default via use_cache=True; multiple cache backends are available via cache_implementation

Terminology

Term	Definition
KV cache	Storage structure accumulating key/value attention tensors across decode steps
Cache eviction	Discarding older K/V entries when memory is exhausted (impacts hit rate)
Prefill phase	Initial forward pass over the full input prompt; populates the cache
Decode phase	Token-by-token generation; fetches from cache and appends new K/V entries
`use_cache`	HuggingFace parameter (default `True`) that enables KV caching in generation

Connections to Existing Wiki Pages

Prerequisite article on transformer tensor shapes: Mastering Tensor Dimensions in Transformers — explains the Q, K, V shape flow this article builds on
Advanced KV cache quantisation: NVFP4 KV Cache — reduces KV cache memory to 4-bit, directly addressing the memory constraint identified here
Source of the attention formula: Attention Is All You Need

Practical Implementation (PyTorch pseudocode)

class KVCache:
    def __init__(self):
        self.cache = {"key": None, "value": None}
 
    def update(self, key, value):
        if self.cache["key"] is None:
            self.cache["key"] = key
            self.cache["value"] = value
        else:
            self.cache["key"] = torch.cat([self.cache["key"], key], dim=1)
            self.cache["value"] = torch.cat([self.cache["value"], value], dim=1)

HuggingFace usage:

output = model.generate(tokens, max_new_tokens=300, use_cache=True)

Personal Wiki

Explorer

KV Caching Explained: Optimizing Transformer Inference Efficiency

KV Caching Explained: Optimizing Transformer Inference Efficiency

Abstract

Key Concepts

Key Equation

Key Claims and Findings

Terminology

Connections to Existing Wiki Pages

Practical Implementation (PyTorch pseudocode)

Graph View

Table of Contents

Backlinks