Chapter 1 of Training Compute-Optimal Large Language Models
Introduction
The central question of this paper: given a fixed compute budget C (FLOPs), how should one trade off model size N and the number of training tokens D?
Prior work (Kaplan et al. 2020) concluded N should scale roughly as N_opt ∝ C^0.73 while D_opt ∝ C^0.27 — model size grows ~3× faster than data. This paper challenges that conclusion through a larger, more carefully controlled set of experiments.
Core Finding
By training over 400 models (70M–16B parameters, 5B–500B tokens), with learning rate schedules correctly calibrated per training run, Hoffmann et al. find that:
N_opt ∝ C^0.5 and D_opt ∝ C^0.5
Model size and training tokens should scale in equal proportion with compute budget. For every doubling of model size, training tokens should also double.
Why Kaplan et al. Was Wrong
The key methodological difference: Kaplan et al. used a fixed cosine learning rate schedule (calibrated for 130B tokens) across all models, regardless of how long each model was actually trained. This causes systematic overestimates of the loss for models trained on less data — undervaluing the benefit of longer training. It pushes the apparent optimum toward larger models.
Practical Implication
At the time of publication, all major LLMs (~300B token budget) were substantially undertrained:
| Model | Size | Compute-optimal training |
|---|---|---|
| GPT-3 | 175B, 300B tokens | Should be ~45B params, 900B tokens |
| Gopher | 280B, 300B tokens | Should be ~67B params, 1.5T tokens |
| MT-NLG | 530B, 270B tokens | Even more overtrained |
The paper validates this by training Chinchilla (70B, 1.4T tokens) at the same compute cost as Gopher — and showing it outperforms Gopher and all larger contemporaries.
Problem Formulation
The training loss L(N, D) is minimized subject to the FLOP constraint:
FLOPs(N, D) ≈ 6ND
(Factor of 6: ×2 for multiply-accumulate, ×3 for forward + backward pass.)
The functions N_opt(C) and D_opt(C) describe the optimal allocation of compute budget C.