Chapter 2 of DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Abstract

This chapter details the architectural evolution from DeepSeek-R1-Zero to the final DeepSeek-R1 model, focusing on the transition from autonomous self-evolution to structured multi-stage reinforcement learning (RL). It establishes the DeepSeek-R1-Zero baseline, characterized by intrinsic increases in thinking time and emergent “aha moments” during training, before addressing readability and language consistency issues through a cold-start policy and subsequent RL phases. The central technical contribution is the hybrid reward system integrating rule-based incentives for reasoning with model-based rewards for general helpfulness and safety, alongside a specific language consistency mechanism to mitigate multilingual mixing. Empirical results demonstrate that this pipeline successfully aligns reasoning capabilities (e.g., AIME 2024, Codeforces) with general instruction-following benchmarks (e.g., AlpacaEval 2.0), proving that RL can autonomously unlock sophisticated problem-solving strategies when properly incentivized.

Key Concepts

  • DeepSeek-R1-Zero Self-Evolution: This phase relies solely on RL to enhance reasoning capabilities without explicit supervision. The model demonstrates a steady increase in thinking time and token generation, driven by intrinsic adaptation rather than external modifications. This autonomous refinement leads to advanced strategies like reflective reasoning, marked by a distinct “aha moment” where the frequency of the word “wait” spikes during reflection.
  • Multi-Stage Training Pipeline: The DeepSeek-R1 architecture utilizes a sequential process starting from a DeepSeek-V3 Base model. It proceeds through cold-start supervised fine-tuning (SFT), an initial RL stage, a second SFT incorporating rejection sampling, and a final RL stage. Checkpoints named Dev1, Dev2, and Dev3 represent intermediate states within this pipeline, each optimizing for different balance points between reasoning and general language capabilities.
  • Hybrid Reward System: The training process employs a combination of rule-based rewards for math and coding tasks and model-based rewards for general queries. The general reward signal, , is dynamically assigned based on whether the query belongs to safety or helpfulness datasets. This dual approach ensures rigorous performance on verifiable reasoning tasks while maintaining alignment with human preferences for non-reasoning content.
  • Language Consistency Reward: A specific penalty term introduced during RL training to address language mixing inherent in the DeepSeek-V3-Base. This reward is calculated as the proportion of target language words within the Chain of Thought (CoT). While this alignment slightly degrades raw performance on some reasoning benchmarks, it significantly improves readability and adheres to human preferences for monolingual responses.
  • Model-Based Preference Alignment: For general data not covered by rule-based systems, preference pairs are curated to train the reward models. The helpful reward model uses a pairwise loss with DeepSeek-V3 judgments, while the safety reward model utilizes point-wise annotations. This allows the policy to optimize for nuanced human preferences regarding utility, relevance, and harmlessness without explicit rule definitions.
  • Staged Hyperparameter Tuning: The first RL stage utilizes a higher sampling temperature of 1.0 to explore diverse generations, whereas the second stage reduces temperature to 0.7 to ensure coherence. The KL coefficient is maintained at 0.001 to prevent excessive divergence from the reference model, and the GRPO clip ratio is set to 10 to balance gradient stability with effective policy updates.
  • Rejection Sampling and SFT Integration: Between RL stages, the model undergoes SFT using both reasoning and non-reasoning datasets. This step enables the model to retain advanced writing capabilities while excelling at reasoning tasks. Rejection sampling is applied to ensure high-quality data is used for fine-tuning, further aligning the model with desired output formats and conversational styles.
  • Rollout and Batch Configuration: Training steps involve sampling 16 outputs per question with a maximum length of 32,768 tokens to capture extensive reasoning traces. Each training step processes 32 unique questions, resulting in a total batch size of 512 per step. To accelerate computation, rollouts generate 8,192 outputs split into 16 minibatches, trained for a single inner epoch per step.

Key Equations and Algorithms

  • Reward Signal Formulation: The general reward is defined as , which corresponds to the specific reward (Safety or Helpfulness) defined within the associated dataset for the given query. This formulation ensures that the model receives appropriate guidance based on the nature of the user input.
  • KL Divergence Constraint: The training optimizes against a reference model using a KL coefficient set to . This parameter penalizes the deviation of the current policy from the reference distribution, ensuring stability during the reinforcement learning process.
  • Gradient Clipping Mechanism: The GRPO algorithm utilizes a clip ratio denoted by , set to a value of . This parameter controls the extent of policy updates during optimization, preventing instability that arises from excessively large gradient magnitudes.
  • Sampling Temperature Adjustment: The inference process is governed by a temperature parameter . This is set to during the first RL stage to encourage exploration and later reduced to in the second stage to prioritize generation coherence.
  • Language Consistency Metric: The reward component for language stability is computed as the proportion of target language words within the generated Chain of Thought. This metric is directly added to the final reward to enforce language purity during reasoning.
  • Training Batch Assembly: The batch size is calculated as , representing 32 unique questions with 16 sampled outputs each. This configuration ensures sufficient statistical variance for policy gradient estimation during each training step.
  • Rollout Limit: The maximum sequence length for training rollouts is capped at tokens. This constraint dictates the maximum computational context available for generating long Chain of Thought traces during the reinforcement learning phase.

Key Claims and Findings

  • DeepSeek-R1-Zero exhibits a distinct “aha moment” during training, characterized by a sudden increase in the use of the word “wait” during reflections, signifying a fundamental shift in reasoning patterns without external intervention.
  • The initial DeepSeek-R1-Zero model struggles with poor readability and language mixing due to the multilingual nature of the DeepSeek-V3-Base training data, necessitating the structured DeepSeek-R1 pipeline.
  • DeepSeek-R1 Dev1 shows substantial improvement in instruction-following on IF-Eval and ArenaHard benchmarks compared to Zero, though it experiences a partial degradation in reasoning performance on AIME due to limited cold-start data.
  • DeepSeek-R1 Dev2 achieves significant performance enhancements on reasoning-intensive benchmarks, including Codeforces (Percentile 2029) and AIME 2024 (79.8 Pass@1), while maintaining general task capabilities.
  • DeepSeek-R1 Dev3 integrates large-scale non-reasoning corpora to enhance proficiency in general language generation, resulting in notable improvements on AlpacaEval 2.0 without sacrificing reasoning capabilities.
  • The final DeepSeek-R1 model demonstrates that reasoning-oriented RL considerably enhances reasoning capabilities while exerting limited influence on user preference-oriented benchmarks, provided the training pipeline includes general instruction data.

Terminology

  • Cold Start Data: A collection of thousands of conversational, human-aligned thinking processes used in the initial SFT stage to establish a baseline thinking process before RL training begins.
  • Aha Moment: A specific phase in DeepSeek-R1-Zero training characterized by a sudden spike in the frequency of the term “wait,” marking a transition to more sophisticated reflective reasoning strategies.
  • Language Consistency Reward: An auxiliary reward signal calculated based on the proportion of target language words in the CoT, designed to mitigate language mixing during multilingual generation.
  • Rule-Based Reward: A deterministic reward signal used primarily for reasoning data (math, coding, logic) where the correctness of the solution can be explicitly verified.
  • Model-Based Reward: A learned reward signal derived from preference pairs, used for general queries to capture subjective notions of helpfulness and harmlessness that rules cannot define.
  • KL Coefficient: A hyperparameter with a value of used to constrain the divergence between the current policy and the reference model during RL optimization.
  • Clip Ratio (): A hyperparameter set to for the GRPO algorithm that determines the threshold for gradient clipping to maintain training stability.
  • Rollout: The process of generating outputs (in this case, 16 per question) from the policy model to evaluate against reward signals during the training step.