Personal Wiki

❯

❯

AI Accelerator Architectures

❯

Low Latency LLM Inference

Low-Latency LLM Inference

May 27, 20261 min read

Low-Latency LLM Inference

Techniques and architectures for minimising latency in LLM inference: prefill/decode disaggregation, speculative decoding, continuous batching, KV-cache management, paged attention, memory-bandwidth optimisation, and the serving frameworks (vLLM, TensorRT-LLM, Triton Inference Server) that tie these together in production.

Ingested Material

Mastering Tensor Dimensions in Transformers
KV Caching Explained: Optimizing Transformer Inference Efficiency
Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache

3 items under this folder.

May 27, 2026
KV Caching Explained: Optimizing Transformer Inference Efficiency
May 27, 2026
Mastering Tensor Dimensions in Transformers
May 27, 2026
Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache

Created with Quartz v4.5.2 © 2026

GitHub
Discord Community