Chapter 5 of Training Compute-Optimal Large Language Models
Discussion and Conclusion
Core Thesis
The predominant 2021–2022 LLM scaling strategy — increase model size while keeping training data constant at ~300B tokens — is suboptimal. For a fixed compute budget, a smaller model trained on far more data will outperform a larger model trained on less. Three independent approaches converge on this prediction, and Chinchilla validates it at scale.
Limitations
-
Only two comparable large-scale runs: Chinchilla and Gopher are the only models at the same compute budget with differing N/D tradeoffs. No intermediate validation points at large scale.
-
Power-law assumption: The efficient frontier is assumed to follow a power law between compute, model size, and tokens. The observed concavity in N_opt at high compute (Appendix E) suggests this may be optimistic — the true optimal model may be even smaller than predicted.
-
Single-epoch regime only: All training runs use less than one epoch of data. How the tradeoff changes when data must be repeated (multi-epoch) is unknown.
Implications for Dataset Scaling
The findings call for a shift in research emphasis from model architecture to data collection:
- Compute-optimal models at large scale require far more tokens than currently used (trillions, not hundreds of billions)
- Larger datasets must be high-quality — scaling low-quality data likely does not yield proportional performance gains
- Train-test overlap must be carefully audited at trillion-token scale, both for LM loss and downstream benchmarks
- Ethical concerns compound with scale: larger web-scraped corpora contain more toxic language and private information in absolute quantity, even if relative frequency is unchanged
Broader Applicability
The N ∝ D equal-scaling result is expected to generalize to other modalities where there is a similar tradeoff between model capacity and data volume. The three proposed methods are straightforward to reproduce in new settings with different model architectures or data domains.