Low-Latency LLM Inference
Techniques and architectures for minimising latency in LLM inference: prefill/decode disaggregation, speculative decoding, continuous batching, KV-cache management, paged attention, memory-bandwidth optimisation, and the serving frameworks (vLLM, TensorRT-LLM, Triton Inference Server) that tie these together in production.