Training Compute-Optimal Large Language Models

Abstract

This paper (Hoffmann et al., DeepMind, NeurIPS 2022) establishes the Chinchilla scaling law: for compute-optimal training of transformer language models, model size N and the number of training tokens D should scale in equal proportion with compute budget C.

N_opt ∝ C^0.5 and D_opt ∝ C^0.5

This overturns the Kaplan et al. (2020) conclusion that model size should scale ~3× faster than data (N_opt ∝ C^0.73, D_opt ∝ C^0.27). The correction has immediate practical consequences: virtually all large language models of 2021–2022 (GPT-3, Gopher, Jurassic-1, MT-NLG) were undertrained relative to their compute budgets, having been made too large and trained on too little data.

The paper validates the prediction by training Chinchilla (70B parameters, 1.4T tokens) at the same compute cost as Gopher (280B, 300B tokens). Chinchilla uniformly and significantly outperforms Gopher and all larger contemporaries.

Fig. 1 — All three estimation approaches agree: for the Gopher compute budget, the optimal model is ~70B parameters (not 280B). Chinchilla validates this prediction empirically.

Key Concepts

Chinchilla Scaling Law: N_opt ∝ C^0.5, D_opt ∝ C^0.5 — model parameters and training tokens should grow at equal rates as compute increases
Compute-Optimal Training: training a model of size N on exactly D_opt(C) tokens for a given compute budget C, as opposed to over-training a larger model on less data
IsoFLOP Profile: a set of training runs at fixed FLOP budget with varying model sizes — the loss valley identifies N_opt for that budget
Parametric Loss Decomposition: L(N, D) = E + A/N^α + B/D^β, where E is irreducible loss, A/N^α is capacity-limited loss, B/D^β is data-limited loss
Training Curve Envelope: the lower envelope of final losses across all (N, D) runs at each FLOP count, used to extract the efficient frontier
MassiveText: the training corpus for Chinchilla and Gopher (MassiveWeb, Books, C4, News, GitHub, Wikipedia; 4.7 TB total)
AdamW: the optimizer used for Chinchilla instead of Adam — provides better LM loss and downstream performance, visible primarily after 80% of the cosine cycle

Key Equations and Algorithms

FLOP approximation: FLOPs(N, D) ≈ 6ND (factors: ×2 multiply-accumulate, ×3 forward+backward)
Parametric loss: L̂(N, D) = E + A/N^α + B/D^β, fit via L-BFGS with Huber loss (δ = 10^-3) over 400+ runs
Efficient frontier: minimizing L̂ subject to FLOPs ≈ 6ND yields N_opt = G × C^0.5, D_opt = G^-1 × C^0.5 for some constant G
Power-law exponents: Approach 1: (a=0.50, b=0.50); Approach 2: (a=0.49, b=0.51); Approach 3: (a=0.46, b=0.54)

Key Claims and Findings

Current LLMs are undertrained: GPT-3, Gopher, and MT-NLG were all trained on ~300B tokens — far less than compute-optimal. A 175B model needs ~3.7T tokens; Gopher (280B) needed ~5.9T tokens for compute-optimal training.
Chinchilla outperforms Gopher uniformly: 67.6% vs. 60.0% MMLU (+7.6%), 65.1% vs. 54.4% BIG-bench (+10.7%), better on all reading comprehension, QA, and common sense benchmarks — at 4× smaller model size.
Three independent approaches agree: Training curve envelope, IsoFLOP profiles, and parametric loss fitting all converge on equal N–D scaling, despite different experimental designs and fitting methodologies.
Kaplan et al. was wrong due to methodology: The fixed cosine schedule in Kaplan et al. undervalues training tokens. Correcting this reverses the conclusion.
TruthfulQA improves substantially: +14.1% zero-shot accuracy (Chinchilla 43.6% vs. Gopher 29.5%) — factual grounding improves with more training data, not just more parameters.
Toxicity is parameter-independent: No meaningful difference in toxic output between Chinchilla and Gopher; toxicity in unconditional generation does not scale with language modelling quality.
Data quality will be the bottleneck: Further scaling requires not just more data, but responsibly collected, high-quality data.

Terminology

Chinchilla: The 70B-parameter compute-optimal model trained by DeepMind. Named after the animal. Trained on MassiveText for 1.4T tokens with the same FLOPs as Gopher.
Gopher: DeepMind’s 280B-parameter LLM (Rae et al. 2021), trained on 300B tokens. The direct baseline and compute-budget comparator for Chinchilla.
FLOPs: Floating-point operations. Measured here as 6ND for a dense transformer with N parameters trained on D tokens.
IsoFLOP: Shorthand for “iso-floating-point-operations” — a family of training runs at fixed total compute with varying model size.
MassiveText: The training corpus assembled by DeepMind, containing MassiveWeb (45%), Books (30%), C4 (10%), News (10%), GitHub (4%), Wikipedia (1%).
Compute-optimal frontier: The set of (N, D, C) triples where further improvement requires increasing the compute budget rather than adjusting N or D.

Chapter Summaries

Connections to Existing Wiki Pages

Attention Is All You Need — the transformer architecture on which Chinchilla, Gopher, GPT-3, and all models in this paper are based
DeepSeek-R1 — a later LLM whose training decisions were informed by Chinchilla scaling insights

Personal Wiki

Explorer