Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache

NVIDIA Developer Blog by Eduardo Alvarez, published 2025-12-08. Production-grade treatment of KV cache quantisation on Blackwell GPUs.

Abstract

This article introduces NVFP4 KV cache quantisation — a method to compress KV cache from 16-bit (or 8-bit FP8) down to 4-bit, implemented via the NVIDIA TensorRT Model Optimizer on Blackwell GPUs. By halving the KV cache memory footprint relative to FP8, NVFP4 effectively doubles the available context budget, enabling larger batch sizes, longer sequences, and higher cache-hit rates. Accuracy loss is under 1% vs. FP16/BF16 baselines across code-generation, knowledge, and long-context benchmarks, and NVFP4 outperforms the competing MXFP4 format by ~5% accuracy due to more granular block scaling. The TTFT (time-to-first-token) latency gain can reach up to 3× over FP8 KV cache at high cache memory budgets, driven by higher hit rates and fewer recomputation stalls during prefill. This piece connects to the fundamentals of KV caching explained in KV Caching Explained.

Key Concepts

NVFP4: NVIDIA’s proprietary 4-bit floating-point format; uses E4M3 FP8 scaling factors at fine-grained block granularity; distinct from MXFP4 (Microscaling FP4) which uses coarser block scaling
KV cache quantisation: reducing precision of stored K/V tensors to cut HBM footprint; dequantisation back to FP8 is required before each attention computation
Prefill phase: model ingests the full input prompt in parallel; produces K/V for all input tokens and stores them; dominated by large MMA operations — highly compute-bound
Decode phase: generates one token at a time; fetches K/V from cache for all prior tokens; memory-bandwidth-bound — smaller cache directly reduces bandwidth pressure
Cache-hit rate: fraction of K/V lookups satisfied from the cache without recomputation; higher hit rate → lower TTFT and higher throughput
Cache eviction: when the fixed KV cache memory pool fills, older entries are evicted; evicted entries cause cache misses and forced recomputation on future requests
HBM budget: total High-Bandwidth Memory on the GPU must be shared between model weights, KV cache, and activations; compressing KV cache frees headroom for weights or larger batches
Wide Expert Parallelism (Wide-EP): NVIDIA technique for MoE models that distributes experts across GPUs; NVFP4 KV cache stacks with Wide-EP to improve utilisation across large MoE deployments
NVIDIA Dynamo: inference routing framework that integrates KV-aware request scheduling and KV cache offloading

Key Claims and Findings

50% memory reduction vs. FP8 KV cache: NVFP4 stores 4 bits per K/V element vs. 8 bits for FP8, halving cache footprint
2× context budget: for a given HBM allocation, NVFP4 can hold twice as many K/V tokens as FP8
Up to 3× TTFT improvement: at high cache-memory budgets, NVFP4’s higher hit rate means the prefill stage rarely needs to recompute evicted K/V tensors — benchmark: Qwen3-Coder-480B-A35B
~20% higher cache-hit rate vs. FP8: larger effective cache (2× context) retains more previously processed spans, reducing misses
<1% accuracy loss vs. BF16/FP8: benchmarks LiveCodeBench, MMLU-PRO, MBPP, Ruler 64K — near parity at 64K context length (Ruler 64K: FP16 95.6%, NVFP4 94.6%)
NVFP4 vs. MXFP4: ~5% better MMLU accuracy (Llama 3.3 70B) due to more granular block scaling and higher-precision E4M3 FP8 scaling factors
Dequantisation required before MMA: K/V tensors are stored as NVFP4 and dequantised to FP8 before the attention matrix multiplication; new K/V are quantised back to NVFP4 before appending to the cache
Plateau effect: as KV cache memory grows, FP8 and NVFP4 hit-rate curves converge; benefit is largest in the memory-constrained regime

Terminology

Term	Definition
NVFP4	NVIDIA 4-bit FP format with E4M3 FP8 block scaling; higher accuracy than MXFP4 at same bit-width
MXFP4	Microscaling FP4; coarser block scaling → ~5% lower accuracy vs. NVFP4 on MMLU
PTQ	Post-Training Quantisation; no retraining required
QAT	Quantisation-Aware Training; fine-tunes to recover accuracy after quantisation
TTFT	Time to First Token; key prefill latency metric for user-perceived responsiveness
MMA	Matrix Multiply-Accumulate; the core compute operation in attention
HBM	High-Bandwidth Memory; on-package DRAM on datacenter GPUs (H100, B100, B200)
Wide-EP	Wide Expert Parallelism; distributes MoE experts across many GPUs on NVL72 racks
TensorRT Model Optimizer	NVIDIA tool (`modelopt`) for PTQ/QAT workflows; provides `mtq.quantize()` API

Implementation (TensorRT Model Optimizer)

import modelopt.torch.quantization as mtq
 
# FP8 weights/activations + NVFP4 KV cache
quant_cfg = mtq.FP8_DEFAULT_CFG
quant_cfg["quant_cfg"].update(mtq.NVFP4_KV_CFG["quant_cfg"])
 
def forward_loop(model):
    for data in calib_set:
        model(data)
 
model = mtq.quantize(model, quant_cfg, forward_loop)
# Optional: continue with QAT for accuracy recovery
train(model, train_loader, optimizer, scheduler, ...)

To also compress weights to NVFP4 (for 4-bit math), replace quant_cfg with mtq.NVFP4_DEFAULT_CFG.

Connections to Existing Wiki Pages

KV caching fundamentals: KV Caching Explained — foundational mechanics this article optimises with quantisation
Transformer tensor shapes and K/V structure: Mastering Tensor Dimensions in Transformers — explains why K/V tensors have the shapes they do
TensorRT-LLM performance analysis: Performance Analysis — TensorRT LLM
Triton + TensorRT-LLM at scale: Scaling LLMs with NVIDIA Triton and TensorRT-LLM Using Kubernetes
Original attention architecture: Attention Is All You Need

Personal Wiki

Explorer

Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache

Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache

Abstract

Key Concepts

Key Claims and Findings

Terminology

Implementation (TensorRT Model Optimizer)

Connections to Existing Wiki Pages

Graph View

Table of Contents

Backlinks