Chapter 6 of Training Compute-Optimal Large Language Models
Appendices A–J
A — Training Dataset (MassiveText Composition)
| Subset | Disk size | Sampling proportion | Epochs in 1.4T tokens |
|---|---|---|---|
| MassiveWeb | 1.9 TB | 45% | 1.24 |
| Books | 2.1 TB | 30% | 0.75 |
| C4 | 0.75 TB | 10% | 0.77 |
| News | 2.7 TB | 10% | 0.21 |
| GitHub | 3.1 TB | 4% | 0.13 |
| Wikipedia | 0.001 TB | 1% | 3.40 |
Wikipedia and MassiveWeb are seen for more than one epoch. The distribution differs slightly from Gopher (in parentheses) to accommodate 4× more training tokens.
B — Optimal Cosine Cycle Length
The cosine learning rate schedule must be calibrated to match the actual number of training steps. Setting cycle length ≥ 1.25× the target training steps causes measurable performance degradation. At 1.5× or longer, the damage is clear.
This is the key methodological difference from Kaplan et al.: Kaplan et al. used a fixed 130B-token cosine schedule for all models. For runs shorter than 130B tokens, the learning rate never drops appropriately — making the loss appear worse than a correctly-calibrated run. This systematically undervalues training on more data at smaller model size, leading to the incorrect conclusion that model size matters more than training tokens.
C — Scaling Results Consistency Across Datasets
IsoFLOP analysis (Approach 2) replicated on C4 and GitHub code. Both yield the same equal N–D scaling conclusion. The result is not a MassiveText artifact. Caveat: all runs are in the single-epoch regime.
D — Details on Scaling Analyses
Full training configurations for all 400+ runs. Head-to-head comparison at 10^21 FLOPs confirms: the Chinchilla-predicted model size outperforms the Kaplan-predicted model size at the same compute budget.
E — Curvature of the FLOP–Loss Frontier
Slight concavity observed in N_opt(C) at high compute budgets (log scale). This is why Approach 3 (parametric fit) gives the most conservative N_opt — its Huber loss fitting assigns relatively lower weight to low-compute points, and the curvature pulls N_opt down further. Concavity implies the true optimal model at very large compute may be even smaller than any of the three approaches predict.
F — FLOPs Computation
FLOPs ≈ 6ND for a transformer language model:
- Factor of 2: multiply-accumulate operation counting
- Factor of 3: forward pass + backward pass (gradient computation + weight update)
Attention FLOPs are subdominant relative to linear layer FLOPs for the model sizes studied (N ≤ 16B parameters at sequence lengths ≤ 2048 tokens) and are excluded from the approximation.
G — Differences Between Chinchilla and Gopher
Three changes: (1) AdamW vs. Adam optimizer, (2) modified SentencePiece tokenizer, (3) adjusted MassiveText sampling proportions. Appendix G isolates each change’s contribution. AdamW provides the largest performance improvement, particularly on downstream tasks after fine-tuning.
H — Full Results Tables
Complete per-task results for Winogender (gender bias breakdown by pronoun type and gotcha/not-gotcha), MMLU (57 individual tasks), BIG-bench (62 individual tasks).
I — Model Card
Chinchilla model card per Mitchell et al. (2019): intended use (research, factual grounding, analysis assistance), out-of-scope uses (real-world decision-making without human oversight), training data, evaluation methodology, ethical considerations, known limitations.
J — List of Trained Models
Complete table of all 400+ models used in the scaling analysis: parameter count, training tokens, FLOPs, final training loss. Full empirical basis for all three scaling approaches.