Ch. 3 — Estimating Optimal Parameter/Token Allocation

Chapter 3 of Training Compute-Optimal Large Language Models

Estimating the Optimal Parameter/Token Allocation

Three independent approaches all converge on the same answer: N and D should scale equally.

Fig. 2 — Training curve envelope (Approach 1): for each FLOP budget, the optimal model size (center) and training tokens (right) are extracted from the envelope of training loss curves across model sizes 70M–10B.

Approach 1 — Training Curve Envelope

For each model size (70M–10B parameters), train for 4 different token counts with a cosine schedule matched to each token count. For each FLOP budget, identify the (N, D) pair that achieves the lowest loss. Fit a power law to the envelope of optimal points.

Result: N_opt ∝ C^0.50, D_opt ∝ C^0.50

The key: every model uses a cosine schedule calibrated to its own training duration. This is what Kaplan et al. did not do.

Approach 2 — IsoFLOP Profiles

Fix 9 FLOP budgets (6×10^18 to 3×10^21). At each budget, vary model size and set training tokens so FLOPs = constant. Plot final loss vs. parameter count — a clear valley emerges. Fit a parabola to each valley to find the optimal N.

Fig. 3 — IsoFLOP curves: each curve shows final loss vs. parameter count at a fixed compute budget. The valley gives N_opt for that budget.

Result: N_opt ∝ C^0.49, D_opt ∝ C^0.51

This approach directly answers: “for this FLOP budget, what is the optimal model size?” without interpolation.

Approach 3 — Parametric Loss Function

Model all final losses as:

L̂(N, D) = E + A/N^α + B/D^β

where:

E = irreducible loss (entropy of natural language)
A/N^α = capacity-limited excess loss (model too small)
B/D^β = data-limited excess loss (insufficient training tokens)

Fit (A, B, E, α, β) by minimizing Huber loss (δ = 10^-3, robust to outliers) via L-BFGS over all 400+ runs. Grid initialization avoids local minima.

Efficient frontier (minimize L̂ subject to FLOPs ≈ 6ND):

Result: N_opt ∝ C^0.46, D_opt ∝ C^0.54

Approach 3 gives the most conservative (smallest) N_opt due to observed curvature in the frontier at high compute (see Appendix E).

Fig. 4 — Parametric fit and efficient frontier: contour plot of L̂(N, D) (left) and isoFLOP slices (right). The efficient frontier (blue) is the Pareto-optimal boundary.

Summary — All Three Approaches vs. Kaplan et al.

Approach	N_opt ∝ C^a	D_opt ∝ C^b
1. Training curve envelope	0.50	0.50
2. IsoFLOP profiles	0.49	0.51
3. Parametric fit	0.46	0.54
Kaplan et al. (2020)	0.73	0.27

Practical Look-Up Table (Approach 1)

Model size	FLOPs required	Training tokens required
1B	1.21×10^20	20.2B
10B	1.23×10^22	205B
67B	5.76×10^23 (Gopher budget)	1.5T
175B	3.85×10^24	3.7T
280B	9.90×10^24	5.9T
520B	3.43×10^25	11.0T

A 175B-parameter model should be trained on 3.7 trillion tokens for compute-optimal training — 12× more than GPT-3 received. A 1T parameter model is not optimal unless the compute budget exceeds ~10^26 FLOPs (>250× Gopher).

Personal Wiki

Explorer