Chapter 6 of Training Compute-Optimal Large Language Models

Appendices A–J

A — Training Dataset (MassiveText Composition)

SubsetDisk sizeSampling proportionEpochs in 1.4T tokens
MassiveWeb1.9 TB45%1.24
Books2.1 TB30%0.75
C40.75 TB10%0.77
News2.7 TB10%0.21
GitHub3.1 TB4%0.13
Wikipedia0.001 TB1%3.40

Wikipedia and MassiveWeb are seen for more than one epoch. The distribution differs slightly from Gopher (in parentheses) to accommodate 4× more training tokens.

B — Optimal Cosine Cycle Length

The cosine learning rate schedule must be calibrated to match the actual number of training steps. Setting cycle length ≥ 1.25× the target training steps causes measurable performance degradation. At 1.5× or longer, the damage is clear.

This is the key methodological difference from Kaplan et al.: Kaplan et al. used a fixed 130B-token cosine schedule for all models. For runs shorter than 130B tokens, the learning rate never drops appropriately — making the loss appear worse than a correctly-calibrated run. This systematically undervalues training on more data at smaller model size, leading to the incorrect conclusion that model size matters more than training tokens.

C — Scaling Results Consistency Across Datasets

IsoFLOP analysis (Approach 2) replicated on C4 and GitHub code. Both yield the same equal N–D scaling conclusion. The result is not a MassiveText artifact. Caveat: all runs are in the single-epoch regime.

D — Details on Scaling Analyses

Full training configurations for all 400+ runs. Head-to-head comparison at 10^21 FLOPs confirms: the Chinchilla-predicted model size outperforms the Kaplan-predicted model size at the same compute budget.

E — Curvature of the FLOP–Loss Frontier

Slight concavity observed in N_opt(C) at high compute budgets (log scale). This is why Approach 3 (parametric fit) gives the most conservative N_opt — its Huber loss fitting assigns relatively lower weight to low-compute points, and the curvature pulls N_opt down further. Concavity implies the true optimal model at very large compute may be even smaller than any of the three approaches predict.

F — FLOPs Computation

FLOPs ≈ 6ND for a transformer language model:

  • Factor of 2: multiply-accumulate operation counting
  • Factor of 3: forward pass + backward pass (gradient computation + weight update)

Attention FLOPs are subdominant relative to linear layer FLOPs for the model sizes studied (N ≤ 16B parameters at sequence lengths ≤ 2048 tokens) and are excluded from the approximation.

G — Differences Between Chinchilla and Gopher

Three changes: (1) AdamW vs. Adam optimizer, (2) modified SentencePiece tokenizer, (3) adjusted MassiveText sampling proportions. Appendix G isolates each change’s contribution. AdamW provides the largest performance improvement, particularly on downstream tasks after fine-tuning.

H — Full Results Tables

Complete per-task results for Winogender (gender bias breakdown by pronoun type and gotcha/not-gotcha), MMLU (57 individual tasks), BIG-bench (62 individual tasks).

I — Model Card

Chinchilla model card per Mitchell et al. (2019): intended use (research, factual grounding, analysis assistance), out-of-scope uses (real-world decision-making without human oversight), training data, evaluation methodology, ethical considerations, known limitations.

J — List of Trained Models

Complete table of all 400+ models used in the scaling analysis: parameter count, training tokens, FLOPs, final training loss. Full empirical basis for all three scaling approaches.