Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache

NVIDIA Developer Blog by Eduardo Alvarez, published 2025-12-08. Production-grade treatment of KV cache quantisation on Blackwell GPUs.

Abstract

This article introduces NVFP4 KV cache quantisation — a method to compress KV cache from 16-bit (or 8-bit FP8) down to 4-bit, implemented via the NVIDIA TensorRT Model Optimizer on Blackwell GPUs. By halving the KV cache memory footprint relative to FP8, NVFP4 effectively doubles the available context budget, enabling larger batch sizes, longer sequences, and higher cache-hit rates. Accuracy loss is under 1% vs. FP16/BF16 baselines across code-generation, knowledge, and long-context benchmarks, and NVFP4 outperforms the competing MXFP4 format by ~5% accuracy due to more granular block scaling. The TTFT (time-to-first-token) latency gain can reach up to 3× over FP8 KV cache at high cache memory budgets, driven by higher hit rates and fewer recomputation stalls during prefill. This piece connects to the fundamentals of KV caching explained in KV Caching Explained.

Key Concepts

  • NVFP4: NVIDIA’s proprietary 4-bit floating-point format; uses E4M3 FP8 scaling factors at fine-grained block granularity; distinct from MXFP4 (Microscaling FP4) which uses coarser block scaling
  • KV cache quantisation: reducing precision of stored K/V tensors to cut HBM footprint; dequantisation back to FP8 is required before each attention computation
  • Prefill phase: model ingests the full input prompt in parallel; produces K/V for all input tokens and stores them; dominated by large MMA operations — highly compute-bound
  • Decode phase: generates one token at a time; fetches K/V from cache for all prior tokens; memory-bandwidth-bound — smaller cache directly reduces bandwidth pressure
  • Cache-hit rate: fraction of K/V lookups satisfied from the cache without recomputation; higher hit rate → lower TTFT and higher throughput
  • Cache eviction: when the fixed KV cache memory pool fills, older entries are evicted; evicted entries cause cache misses and forced recomputation on future requests
  • HBM budget: total High-Bandwidth Memory on the GPU must be shared between model weights, KV cache, and activations; compressing KV cache frees headroom for weights or larger batches
  • Wide Expert Parallelism (Wide-EP): NVIDIA technique for MoE models that distributes experts across GPUs; NVFP4 KV cache stacks with Wide-EP to improve utilisation across large MoE deployments
  • NVIDIA Dynamo: inference routing framework that integrates KV-aware request scheduling and KV cache offloading

Key Claims and Findings

  • 50% memory reduction vs. FP8 KV cache: NVFP4 stores 4 bits per K/V element vs. 8 bits for FP8, halving cache footprint
  • 2× context budget: for a given HBM allocation, NVFP4 can hold twice as many K/V tokens as FP8
  • Up to 3× TTFT improvement: at high cache-memory budgets, NVFP4’s higher hit rate means the prefill stage rarely needs to recompute evicted K/V tensors — benchmark: Qwen3-Coder-480B-A35B
  • ~20% higher cache-hit rate vs. FP8: larger effective cache (2× context) retains more previously processed spans, reducing misses
  • <1% accuracy loss vs. BF16/FP8: benchmarks LiveCodeBench, MMLU-PRO, MBPP, Ruler 64K — near parity at 64K context length (Ruler 64K: FP16 95.6%, NVFP4 94.6%)
  • NVFP4 vs. MXFP4: ~5% better MMLU accuracy (Llama 3.3 70B) due to more granular block scaling and higher-precision E4M3 FP8 scaling factors
  • Dequantisation required before MMA: K/V tensors are stored as NVFP4 and dequantised to FP8 before the attention matrix multiplication; new K/V are quantised back to NVFP4 before appending to the cache
  • Plateau effect: as KV cache memory grows, FP8 and NVFP4 hit-rate curves converge; benefit is largest in the memory-constrained regime

Terminology

TermDefinition
NVFP4NVIDIA 4-bit FP format with E4M3 FP8 block scaling; higher accuracy than MXFP4 at same bit-width
MXFP4Microscaling FP4; coarser block scaling → ~5% lower accuracy vs. NVFP4 on MMLU
PTQPost-Training Quantisation; no retraining required
QATQuantisation-Aware Training; fine-tunes to recover accuracy after quantisation
TTFTTime to First Token; key prefill latency metric for user-perceived responsiveness
MMAMatrix Multiply-Accumulate; the core compute operation in attention
HBMHigh-Bandwidth Memory; on-package DRAM on datacenter GPUs (H100, B100, B200)
Wide-EPWide Expert Parallelism; distributes MoE experts across many GPUs on NVL72 racks
TensorRT Model OptimizerNVIDIA tool (modelopt) for PTQ/QAT workflows; provides mtq.quantize() API

Implementation (TensorRT Model Optimizer)

import modelopt.torch.quantization as mtq
 
# FP8 weights/activations + NVFP4 KV cache
quant_cfg = mtq.FP8_DEFAULT_CFG
quant_cfg["quant_cfg"].update(mtq.NVFP4_KV_CFG["quant_cfg"])
 
def forward_loop(model):
    for data in calib_set:
        model(data)
 
model = mtq.quantize(model, quant_cfg, forward_loop)
# Optional: continue with QAT for accuracy recovery
train(model, train_loader, optimizer, scheduler, ...)

To also compress weights to NVFP4 (for 4-bit math), replace quant_cfg with mtq.NVFP4_DEFAULT_CFG.

Connections to Existing Wiki Pages