KV Caching Explained: Optimizing Transformer Inference Efficiency

Hugging Face blog post by Not Lain, published 2025-01-30. Introductory-level treatment of KV caching mechanics.

Abstract

KV caching is a technique that eliminates redundant attention computation during autoregressive text generation. In a standard transformer, generating each new token requires recomputing key and value projections for every preceding token — work that grows quadratically with sequence length. KV caching breaks this by storing those intermediate key and value tensors after they are first computed, so each subsequent decode step only needs to compute Q, K, V for the single new token and fetch the rest from the cache. The result is dramatically faster generation, particularly for long sequences, at the cost of increased memory usage. A benchmark on a T4 GPU demonstrated roughly 5× speedup over standard inference for a 300-token generation task.

Key Concepts

  • Autoregressive generation: tokens are produced one at a time; each step conditions on all previous tokens
  • Key (K) and Value (V) tensors: intermediate representations computed per-token in the attention mechanism; immutable once computed (previous tokens never change their K/V under causal masking)
  • Query (Q) tensor: computed only for the current new token at each decode step
  • KV cache: a data structure that accumulates K and V tensors as tokens are generated, growing by one entry per decode step per layer
  • Cache hit vs. miss: terminology for whether a needed K/V entry is present; a miss forces recomputation (see NVFP4 KV Cache for cache eviction dynamics)

Key Equation

The scaled dot-product attention formula from Attention Is All You Need:

With KV caching, on decode step : Q is (current token only), while K and V are (retrieved from cache). Without the cache, K and V would each be and recomputed from scratch each step.

Key Claims and Findings

  • Speedup is real and measurable: ~5.21× faster than standard inference on T4 GPU for 300-token generation
  • Memory grows linearly with sequence length: cache size = 2 × layers × heads × seq_len × head_dim × dtype_bytes; this is the primary constraint for long contexts
  • KV caching is applied per-layer: each transformer block has its own attention heads and therefore its own K/V cache
  • Only K and V are cached, not Q: Q changes at every decode step (it depends on the current token); K and V for past tokens never change due to causal masking
  • HuggingFace transformers enables it by default via use_cache=True; multiple cache backends are available via cache_implementation

Terminology

TermDefinition
KV cacheStorage structure accumulating key/value attention tensors across decode steps
Cache evictionDiscarding older K/V entries when memory is exhausted (impacts hit rate)
Prefill phaseInitial forward pass over the full input prompt; populates the cache
Decode phaseToken-by-token generation; fetches from cache and appends new K/V entries
use_cacheHuggingFace parameter (default True) that enables KV caching in generation

Connections to Existing Wiki Pages

Practical Implementation (PyTorch pseudocode)

class KVCache:
    def __init__(self):
        self.cache = {"key": None, "value": None}
 
    def update(self, key, value):
        if self.cache["key"] is None:
            self.cache["key"] = key
            self.cache["value"] = value
        else:
            self.cache["key"] = torch.cat([self.cache["key"], key], dim=1)
            self.cache["value"] = torch.cat([self.cache["value"], value], dim=1)

HuggingFace usage:

output = model.generate(tokens, max_new_tokens=300, use_cache=True)