Chapter 4 of Training Compute-Optimal Large Language Models

Chinchilla — Training and Results

Architecture and Training Configuration

The compute-optimal model for the Gopher FLOP budget (5.76×10^23) is predicted to be 40–70B parameters. Chinchilla is trained at 70B, on 1.4T tokens.

Gopher 280BChinchilla 70B
Layers8080
Attention heads12864
d_model16,3848,192
Feed-forward4 × d_model4 × d_model
Max learning rate4×10^-51×10^-4
Batch size3M→6M tokens1.5M→3M tokens
Training tokens300B1.4T
Compute (FLOPs)5.76×10^235.76×10^23

Same compute budget. 4× smaller model. 4× more data.

Key changes from Gopher:

  • AdamW (Loshchilov & Hutter 2019) instead of Adam — better loss and downstream performance, especially post-fine-tuning. The advantage only becomes visible after ~80% of the cosine cycle.
  • Modified SentencePiece tokenizer without NFKC normalization — 94.15% vocabulary overlap; improves math and chemistry representation.
  • Adjusted MassiveText distribution — slightly different subset weights to account for 1.4T training tokens (MassiveWeb: 45%, Books: 30%, C4: 10%, News: 10%, GitHub: 4%, Wikipedia: 1%).
  • Mixed precision: bfloat16 forward/backward, float32 optimizer state (via ZeRO).
  • Hardware: TPUv3/TPUv4, trained with JAX + Haiku.

Results

Language Modelling (The Pile)

Chinchilla outperforms Gopher on all subsets of The Pile. WikiText-103 perplexity: 7.16 (Chinchilla) vs. 7.75 (Gopher).

Fig. 5 — Bits-per-byte improvement of Chinchilla over Gopher across all Pile subsets. Chinchilla is better on every subset.

MMLU (57-task exam benchmark)

Model5-shot accuracy
GPT-3 (175B)43.9%
Gopher (280B)60.0%
Chinchilla (70B)67.6%
Human expert average89.8%

Chinchilla exceeds the 2023 expert forecast (63.4%). Outperforms Gopher on 51/57 tasks, same on 2, worse on 4.

Fig. 6 — MMLU per-task comparison: Chinchilla vs. Gopher. Chinchilla outperforms on 51/57 tasks.

BIG-bench (62 tasks)

Chinchilla: 65.1% average accuracy vs. 54.4% for Gopher (+10.7%). Underperforms Gopher on only 4/62 tasks.

Fig. 7 — BIG-bench per-task comparison. Chinchilla outperforms on 58/62 tasks.

Reading Comprehension

TaskChinchillaGopherGPT-3
LAMBADA (zero-shot)77.4%74.5%76.2%
RACE-h (few-shot)82.3%71.6%46.8%
RACE-m (few-shot)86.8%75.1%58.1%

Common Sense (zero-shot)

Chinchilla outperforms Gopher and GPT-3 on all 5 benchmarks (HellaSwag, Winogrande, SIQA, BoolQ, PIQA) and outperforms MT-NLG 530B on 4/5.

TruthfulQA 10-shot: 66.7% vs. 43.7% (Gopher) — factual accuracy improves substantially with more training data, not just more parameters.

Closed-Book Question Answering

  • Natural Questions 64-shot: 35.5% vs. 28.2% (Gopher) — new closed-book SOTA at time of publication
  • TriviaQA filtered: 64.6% vs. 57.2% (Gopher), within 7.9% of open-book SOTA (FiD + Distillation)

Bias and Toxicity

Gender bias (Winogender): Chinchilla resolves coreference pronouns correctly more often than Gopher across all pronoun types (+3.2% male, +8.3% female, +9.2% neutral). Larger improvement on female gotcha examples (+10%). Bias still present — the improvement is uneven, suggesting better modelling does not eliminate stereotype-driven errors.

Toxicity: Mean PerspectiveAPI score 0.087 (Chinchilla) vs. 0.081 (Gopher) — negligible difference. Toxicity in unconditional generation is largely independent of language modelling quality.