Ch. 1 — Introduction

Chapter 1 of Training Compute-Optimal Large Language Models

Introduction

The central question of this paper: given a fixed compute budget C (FLOPs), how should one trade off model size N and the number of training tokens D?

Prior work (Kaplan et al. 2020) concluded N should scale roughly as N_opt ∝ C^0.73 while D_opt ∝ C^0.27 — model size grows ~3× faster than data. This paper challenges that conclusion through a larger, more carefully controlled set of experiments.

Core Finding

By training over 400 models (70M–16B parameters, 5B–500B tokens), with learning rate schedules correctly calibrated per training run, Hoffmann et al. find that:

N_opt ∝ C^0.5 and D_opt ∝ C^0.5

Model size and training tokens should scale in equal proportion with compute budget. For every doubling of model size, training tokens should also double.

Why Kaplan et al. Was Wrong

The key methodological difference: Kaplan et al. used a fixed cosine learning rate schedule (calibrated for 130B tokens) across all models, regardless of how long each model was actually trained. This causes systematic overestimates of the loss for models trained on less data — undervaluing the benefit of longer training. It pushes the apparent optimum toward larger models.

Practical Implication

At the time of publication, all major LLMs (~300B token budget) were substantially undertrained:

Model	Size	Compute-optimal training
GPT-3	175B, 300B tokens	Should be ~45B params, 900B tokens
Gopher	280B, 300B tokens	Should be ~67B params, 1.5T tokens
MT-NLG	530B, 270B tokens	Even more overtrained

The paper validates this by training Chinchilla (70B, 1.4T tokens) at the same compute cost as Gopher — and showing it outperforms Gopher and all larger contemporaries.

Problem Formulation

The training loss L(N, D) is minimized subject to the FLOP constraint:

FLOPs(N, D) ≈ 6ND

(Factor of 6: ×2 for multiply-accumulate, ×3 for forward + backward pass.)

The functions N_opt(C) and D_opt(C) describe the optimal allocation of compute budget C.

Personal Wiki

Explorer

Ch. 1 — Introduction

Introduction

Core Finding

Why Kaplan et al. Was Wrong

Practical Implication

Problem Formulation

Graph View

Table of Contents

Backlinks