KV Caching Explained: Optimizing Transformer Inference Efficiency
Hugging Face blog post by Not Lain, published 2025-01-30. Introductory-level treatment of KV caching mechanics.
Abstract
KV caching is a technique that eliminates redundant attention computation during autoregressive text generation. In a standard transformer, generating each new token requires recomputing key and value projections for every preceding token — work that grows quadratically with sequence length. KV caching breaks this by storing those intermediate key and value tensors after they are first computed, so each subsequent decode step only needs to compute Q, K, V for the single new token and fetch the rest from the cache. The result is dramatically faster generation, particularly for long sequences, at the cost of increased memory usage. A benchmark on a T4 GPU demonstrated roughly 5× speedup over standard inference for a 300-token generation task.
Key Concepts
- Autoregressive generation: tokens are produced one at a time; each step conditions on all previous tokens
- Key (K) and Value (V) tensors: intermediate representations computed per-token in the attention mechanism; immutable once computed (previous tokens never change their K/V under causal masking)
- Query (Q) tensor: computed only for the current new token at each decode step
- KV cache: a data structure that accumulates K and V tensors as tokens are generated, growing by one entry per decode step per layer
- Cache hit vs. miss: terminology for whether a needed K/V entry is present; a miss forces recomputation (see NVFP4 KV Cache for cache eviction dynamics)
Key Equation
The scaled dot-product attention formula from Attention Is All You Need:
With KV caching, on decode step : Q is (current token only), while K and V are (retrieved from cache). Without the cache, K and V would each be and recomputed from scratch each step.
Key Claims and Findings
- Speedup is real and measurable: ~5.21× faster than standard inference on T4 GPU for 300-token generation
- Memory grows linearly with sequence length: cache size =
2 × layers × heads × seq_len × head_dim × dtype_bytes; this is the primary constraint for long contexts - KV caching is applied per-layer: each transformer block has its own attention heads and therefore its own K/V cache
- Only K and V are cached, not Q: Q changes at every decode step (it depends on the current token); K and V for past tokens never change due to causal masking
- HuggingFace
transformersenables it by default viause_cache=True; multiple cache backends are available viacache_implementation
Terminology
| Term | Definition |
|---|---|
| KV cache | Storage structure accumulating key/value attention tensors across decode steps |
| Cache eviction | Discarding older K/V entries when memory is exhausted (impacts hit rate) |
| Prefill phase | Initial forward pass over the full input prompt; populates the cache |
| Decode phase | Token-by-token generation; fetches from cache and appends new K/V entries |
use_cache | HuggingFace parameter (default True) that enables KV caching in generation |
Connections to Existing Wiki Pages
- Prerequisite article on transformer tensor shapes: Mastering Tensor Dimensions in Transformers — explains the Q, K, V shape flow this article builds on
- Advanced KV cache quantisation: NVFP4 KV Cache — reduces KV cache memory to 4-bit, directly addressing the memory constraint identified here
- Source of the attention formula: Attention Is All You Need
Practical Implementation (PyTorch pseudocode)
class KVCache:
def __init__(self):
self.cache = {"key": None, "value": None}
def update(self, key, value):
if self.cache["key"] is None:
self.cache["key"] = key
self.cache["value"] = value
else:
self.cache["key"] = torch.cat([self.cache["key"], key], dim=1)
self.cache["value"] = torch.cat([self.cache["value"], value], dim=1)HuggingFace usage:
output = model.generate(tokens, max_new_tokens=300, use_cache=True)