Chapter 13 of DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Abstract

This chapter evaluates the efficacy of distilling reasoning capabilities from the DeepSeek-R1 model into smaller student architectures compared to training from scratch via Reinforcement Learning (RL). It establishes that high-quality distillation enables models with as few as 1.5 billion parameters to surpass larger closed-source baselines on mathematical and coding benchmarks. Furthermore, the chapter details the critical dependency of RL effectiveness on base model capacity, presenting empirical evidence that smaller models fail to leverage long chains of thought without substantial pre-training investment, while distillation remains a computationally economical alternative for democratizing AI access.

Key Concepts

  • Model Distillation: The process of transferring reasoning trajectories from DeepSeek-R1 to smaller student models using approximately 800,000 generated samples. This technique relies on Supervised Fine-Tuning (SFT) without an RL stage, demonstrating that high-quality teacher outputs consistently outperform human-generated data baselines, corroborating prior findings by Busbridge et al., 2025.
  • Reinforcement Learning from Scratch: A training methodology where models are optimized using policy gradients on math, code, and STEM data without distillation. This approach, exemplified by Qwen2.5-32B-Zero, requires over 10K update steps and significant computational resources, often yielding lower performance than distilled counterparts for smaller architectures.
  • Base Model Capacity Dependency: The experimental finding that the efficacy of RL training is strictly dependent on the underlying model size. Configurations such as 7B dense or 16B Mixture-of-Experts (MoE) models failed to improve on AIME benchmarks due to repetition and an inability to leverage long chains of thought, whereas 32B, 230B, and 671B MoE models showed substantial gains.
  • Verifier Mechanisms: The reliance on rule-based reward models (RMs) or LLM-based evaluations to assess correctness against ground truth. This mechanism is critical for mitigating reward hacking, particularly effective for concise answers but less generalizable to open-ended generation tasks where correctness is subjective.
  • Process Reward Models (PRM): A strategy attempting to guide models via fine-grained step-by-step rewards, which was found to have three main limitations: difficulty in defining fine-grain steps, challenges in verifying intermediate steps automatically, and susceptibility to reward hacking that necessitates recompilation of the pipeline.
  • Monte Carlo Tree Search (MCTS): An algorithm explored to enhance test-time compute by guiding the model to explore the solution space via systematic search. While effective for games like chess, it faced scaling issues in token generation due to the exponentially larger search space and the difficulty of training a fine-gined value model for iterative improvement.
  • Chain-of-Thought (CoT): A reasoning framework prompting models to generate intermediate reasoning steps before final answers. This chapter notes that without an RL stage, complex inference patterns required for long-chain reasoning remain largely unexplored in standard SFT pipelines.
  • Inference-time Compute Scaling: The strategy of trading computation for performance during inference, defined as methods that improve model performance by increasing inference compute. This includes generating diverse reasoning chains and selecting the best answer using rerankers or process-based reward models.

Key Equations and Algorithms

  • Distillation Training Algorithm: This procedure involves generating 800,000 samples with DeepSeek-R1 and applying only SFT to student models. The computational complexity is reduced compared to RL as it omits the RL stage, yet performance scales progressively with student model parameters.
  • RL Training Pipeline: A multi-stage pipeline described in Appendix B.4.1 where base models like Qwen2.5-32B-Base undergo large-scale RL training for over 10,000 steps. This process utilizes policy gradients to explore optimal reasoning trajectories not captured by human-annotated traces.
  • MCTS Evaluation Procedure: An algorithm where prompts are used to find answers via search guided by a pre-trained value model. The procedure involves breaking answers into smaller parts to allow systematic exploration, though it risks local optima due to extension limits imposed on nodes.
  • Statistical Significance Testing: A method used to validate model performance differences, denoted by in Table 15. This statistical test ensures that performance improvements observed in distilled models over baselines like GPT-4o-0513 are not due to random variance.
  • Reward Verification Logic: The logic for mitigating reward hacking uses two approaches: rule-based RMs and LLM-based assessment. For tasks with well-defined answers, LLM-based evaluation demonstrates robustness, whereas open-ended tasks require the SFT stage to prevent suboptimal behavior caused by exclusive reliance on RL.

Key Claims and Findings

  • Distilled DeepSeek-R1-Qwen-1.5B surpasses non-reasoning baselines on mathematical benchmarks, achieving scores that outperform GPT-4o-0513 (9.3%) and Claude-3.5-Sonnet-1022 (16.0%) with only 1.5 billion parameters.
  • DeepSeek-R1-Distill-Qwen-32B achieves a score of 72.6 on AIME 2024, statistically outperforming the Qwen2.5-32B-Zero model trained via large-scale RL which scored 60.0 on the same benchmark.
  • Smaller models relying solely on large-scale RL require enormous computational power and may fail to achieve the performance levels attained through distillation, as evidenced by Qwen2.5-32B-Zero underperforming the distilled 32B variant across all benchmarks.
  • Base checkpoint selection is critical; models smaller than 32B dense or equivalent MoE sizes consistently failed to yield meaningful AIME improvements, exhibiting repetition and failure to leverage long chains of thought.
  • Process Reward Models (PRM) introduce significant computational overhead without proportional gains, as they complicate the training pipeline and require additional training resources to retrain the reward model for reward hacking mitigation.
  • Iterative pipelines combining SFT and RL are indispensable; exclusive reliance on RL leads to reward hacking in ill-posed tasks, while depending solely on SFT prevents optimization of reasoning capabilities through exploration.

Terminology

  • DeepSeek-R1: The teacher model used to generate the 800,000 high-quality reasoning samples for the distillation dataset construction detailed in Appendix B.3.3.
  • SFT (Supervised Fine-Tuning): The training stage applied to distilled models that utilizes teacher outputs without an accompanying RL stage to demonstrate distillation efficacy.
  • RL (Reinforcement Learning): The training stage involving policy gradient updates to explore reasoning trajectories, applied here to Qwen2.5-32B-Base for over 10K steps.
  • Pass@1: The primary metric for evaluating reasoning performance, representing the probability that the single best generated response is correct.
  • Cons@64: A metric representing the consistency of the model’s output across 64 generations, used to evaluate performance stability in Table 15.
  • Base Checkpoint: The pre-trained model architecture (e.g., 32B Dense, 230B MoE) used as the foundation for further RL or SFT training.
  • Reward Hacking: Suboptimal behavior emerging when models exploit flaws in the reward signal, mitigated in this work through robust verifiers and iterative SFT pipelines.
  • Student Model: The smaller architecture (e.g., Llama-8B, Qwen-1.5B) being trained via distillation from the DeepSeek-R1 teacher outputs.