Abstract
This paper introduces DeepSeek-R1 and its precursor DeepSeek-R1-Zero, demonstrating that advanced reasoning capabilities in large language models can be autonomously incentivized through large-scale reinforcement learning without relying on human-annotated chain-of-thought demonstrations. By isolating base models from initial supervised fine-tuning and training them with Group Relative Policy Optimization (GRPO) guided by deterministic rule-based rewards, the authors show that models naturally evolve sophisticated reasoning behaviors such as self-reflection, backtracking, and dynamic test-time compute allocation. The work details a multi-stage training pipeline to correct readability and safety deficiencies, validates the approach through distillation into lightweight open-weight models, and provides a comprehensive safety benchmark, establishing a scalable, verifier-driven paradigm for developing next-generation reasoning systems.
Key Concepts
- Outcome-based reinforcement learning for reasoning emergence
- Group Relative Policy Optimization (GRPO) and value-free advantage estimation
- Emergent long chain-of-thought (CoT) and self-correction mechanisms
- Dynamic test-time compute scaling based on input difficulty
- Cold-start initialization and rejection sampling for stability
- Knowledge distillation of reasoning trajectories into smaller models
- Multi-stage alignment pipeline (RL → SFT → Preference RL)
- External risk-control systems for jailbreak mitigation
Key Equations and Algorithms
- GRPO Objective Function: Maximizes the clipped importance-sampled policy advantage across all outputs in a sampled group while penalizing distributional divergence from a reference model via a term.
- Group-Based Advantage Estimation: Calculates per-sample advantage using intra-group reward normalization () to approximate policy gradients without requiring a separate value network.
- Rule-Based Reward Formulation: Assigns a composite score by summing binary accuracy rewards (verifying output against ground truth or compiler rules) and format consistency rewards for structured tasks.
- Pairwise Ranking Reward Model: Trains a helper model using a cross-entropy loss over preference pairs to predict scalar helpfulness scores while explicitly controlling for positional and response length biases.
- Multi-Stage Post-Training Pipeline: Sequences cold-start RL, language consistency-constrained RL, rejection-sampled SFT, and secondary preference-aligned RL to bootstrap reasoning emergence while correcting readability, safety, and instruction-following misalignments.
Key Claims and Findings
- Bypassing initial supervised fine-tuning and applying pure RL to a capable base model effectively stimulates the emergence of unconstrained, high-fidelity reasoning strategies without human-bias constraints.
- GRPO matches or exceeds PPO performance on reasoning benchmarks while offering superior training stability and significantly lower memory overhead by eliminating the value model requirement.
- Trained models exhibit dynamic test-time compute scaling, automatically extending their reasoning token budget proportionally to problem complexity while remaining concise for trivial tasks.
- Distilling R1’s trajectories into open-weight models down to 1.5B parameters yields massive reasoning gains, routinely surpassing non-reasoning proprietary baselines like GPT-4o on math and coding tasks.
- The base R1 model maintains a moderate intrinsic safety profile, but integrating an external risk-control system elevates its resistance to jailbreak attacks and cross-lingual misalignment to state-of-the-art levels.
Terminology
- DeepSeek-R1-Zero: The initial RL-only variant trained exclusively from a base checkpoint using cold-start prompts and rule-based rewards, without preliminary supervised fine-tuning.
- Language Consistency (LC) Reward: A supplementary RL training signal calculated as the proportion of target-language tokens in the chain-of-thought to penalize code-switching and enforce monolingual coherence.
- Test-Time Compute Scaling: The model’s learned capability to dynamically allocate generation length and token counts at inference time based on the perceived difficulty of the input prompt.
- Reward Hacking: The phenomenon where a policy model discovers loopholes or exploits biases in a reward function to maximize objective scores without developing genuine capability or aligned behavior.
- Pass@k Evaluation: A confidence-based metric where the model generates independent reasoning samples per query, and success is recorded if at least one of the outputs matches the ground truth.