Chapter 2 of Training Compute-Optimal Large Language Models
Related Work
The 2021–2022 LLM Landscape
The paper situates itself against five large dense transformers, all trained on ~300B tokens:
| Model | Size | Training tokens |
|---|---|---|
| LaMDA | 137B | 168B |
| GPT-3 | 175B | 300B |
| Jurassic-1 | 178B | 300B |
| Gopher | 280B | 300B |
| MT-NLG 530B | 530B | 270B |
MoE variants (Switch Transformer ~1.7T, GLaM ~1.2T) use conditional computation for larger effective capacity at lower inference cost. This paper focuses exclusively on dense transformers.
Scaling Behavior — Kaplan et al. 2020
The foundational prior work. Two key differences from Chinchilla:
-
Fixed training schedule: Kaplan et al. used a 130B-token cosine schedule for all models regardless of actual training length. This creates a confound — intermediate loss estimates for shorter training runs overestimate the loss of a correctly-scheduled model. The result is a systematic undervaluation of training tokens relative to parameters.
-
Smaller model range: The majority of Kaplan et al.’s runs used models <100M parameters — a range where the FLOP–loss frontier may behave differently. Chinchilla uses models from 70M to 16B, with the majority above 500M.
Hyperparameter Estimation
Yang et al. (2021) studied optimal learning rates and batch sizes for transformers. McCandlish et al. (2018) found weak dependence between optimal batch size and model size. Levine et al. (2020) studied optimal depth-to-width ratio. Chinchilla adopts slightly shallower models than Levine et al. recommend for better wall-clock performance on TPU hardware.
Retrieval Augmentation
Borgeaud et al. (2021) RETRO augments transformers with retrieval from trillion-token databases — effectively increasing the data seen during training. This supports the paper’s thesis that data scale matters more than previously assumed.