Personal Wiki

Home

❯

AI / ML

❯

AI Accelerator Architectures

❯

Low Latency LLM Inference

Low-Latency LLM Inference

May 27, 20261 min read

Low-Latency LLM Inference

Techniques and architectures for minimising latency in LLM inference: prefill/decode disaggregation, speculative decoding, continuous batching, KV-cache management, paged attention, memory-bandwidth optimisation, and the serving frameworks (vLLM, TensorRT-LLM, Triton Inference Server) that tie these together in production.

Ingested Material

  • Mastering Tensor Dimensions in Transformers
  • KV Caching Explained: Optimizing Transformer Inference Efficiency
  • Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache

3 items under this folder.

  • May 27, 2026

    KV Caching Explained: Optimizing Transformer Inference Efficiency

    • kv-cache
    • inference-optimization
    • transformer
    • self-attention
    • inference-latency
    • context-window
    • llm
  • May 27, 2026

    Mastering Tensor Dimensions in Transformers

    • transformer
    • self-attention
    • positional-encoding
    • neural-network
    • deep-learning
    • encoder-decoder
    • kv-cache
  • May 27, 2026

    Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache

    • kv-cache
    • quantization
    • inference-latency
    • time-to-first-token
    • serving-throughput
    • gpu-memory-bandwidth
    • tensorrt-llm
    • llm
    • inference-optimization
    • mixture-of-experts

Created with Quartz v4.5.2 © 2026

  • GitHub
  • Discord Community