Chapter 2 of Training Compute-Optimal Large Language Models

The 2021–2022 LLM Landscape

The paper situates itself against five large dense transformers, all trained on ~300B tokens:

ModelSizeTraining tokens
LaMDA137B168B
GPT-3175B300B
Jurassic-1178B300B
Gopher280B300B
MT-NLG 530B530B270B

MoE variants (Switch Transformer ~1.7T, GLaM ~1.2T) use conditional computation for larger effective capacity at lower inference cost. This paper focuses exclusively on dense transformers.

Scaling Behavior — Kaplan et al. 2020

The foundational prior work. Two key differences from Chinchilla:

  1. Fixed training schedule: Kaplan et al. used a 130B-token cosine schedule for all models regardless of actual training length. This creates a confound — intermediate loss estimates for shorter training runs overestimate the loss of a correctly-scheduled model. The result is a systematic undervaluation of training tokens relative to parameters.

  2. Smaller model range: The majority of Kaplan et al.’s runs used models <100M parameters — a range where the FLOP–loss frontier may behave differently. Chinchilla uses models from 70M to 16B, with the majority above 500M.

Hyperparameter Estimation

Yang et al. (2021) studied optimal learning rates and batch sizes for transformers. McCandlish et al. (2018) found weak dependence between optimal batch size and model size. Levine et al. (2020) studied optimal depth-to-width ratio. Chinchilla adopts slightly shallower models than Levine et al. recommend for better wall-clock performance on TPU hardware.

Retrieval Augmentation

Borgeaud et al. (2021) RETRO augments transformers with retrieval from trillion-token databases — effectively increasing the data seen during training. This supports the paper’s thesis that data scale matters more than previously assumed.