Ch. 2 — Related Work

Chapter 2 of Training Compute-Optimal Large Language Models

The 2021–2022 LLM Landscape

The paper situates itself against five large dense transformers, all trained on ~300B tokens:

Model	Size	Training tokens
LaMDA	137B	168B
GPT-3	175B	300B
Jurassic-1	178B	300B
Gopher	280B	300B
MT-NLG 530B	530B	270B

MoE variants (Switch Transformer ~1.7T, GLaM ~1.2T) use conditional computation for larger effective capacity at lower inference cost. This paper focuses exclusively on dense transformers.

Scaling Behavior — Kaplan et al. 2020

The foundational prior work. Two key differences from Chinchilla:

Fixed training schedule: Kaplan et al. used a 130B-token cosine schedule for all models regardless of actual training length. This creates a confound — intermediate loss estimates for shorter training runs overestimate the loss of a correctly-scheduled model. The result is a systematic undervaluation of training tokens relative to parameters.
Smaller model range: The majority of Kaplan et al.’s runs used models <100M parameters — a range where the FLOP–loss frontier may behave differently. Chinchilla uses models from 70M to 16B, with the majority above 500M.

Hyperparameter Estimation

Yang et al. (2021) studied optimal learning rates and batch sizes for transformers. McCandlish et al. (2018) found weak dependence between optimal batch size and model size. Levine et al. (2020) studied optimal depth-to-width ratio. Chinchilla adopts slightly shallower models than Levine et al. recommend for better wall-clock performance on TPU hardware.

Retrieval Augmentation

Borgeaud et al. (2021) RETRO augments transformers with retrieval from trillion-token databases — effectively increasing the data seen during training. This supports the paper’s thesis that data scale matters more than previously assumed.

Personal Wiki

Explorer

Ch. 2 — Related Work

The 2021–2022 LLM Landscape

Scaling Behavior — Kaplan et al. 2020

Hyperparameter Estimation

Retrieval Augmentation

Graph View

Table of Contents

Backlinks