Abstract

This paper introduces DeepSeek-R1 and its precursor DeepSeek-R1-Zero, demonstrating that advanced reasoning capabilities in large language models can be autonomously incentivized through large-scale reinforcement learning without relying on human-annotated chain-of-thought demonstrations. By isolating base models from initial supervised fine-tuning and training them with Group Relative Policy Optimization (GRPO) guided by deterministic rule-based rewards, the authors show that models naturally evolve sophisticated reasoning behaviors such as self-reflection, backtracking, and dynamic test-time compute allocation. The work details a multi-stage training pipeline to correct readability and safety deficiencies, validates the approach through distillation into lightweight open-weight models, and provides a comprehensive safety benchmark, establishing a scalable, verifier-driven paradigm for developing next-generation reasoning systems.

Key Concepts

  • Outcome-based reinforcement learning for reasoning emergence
  • Group Relative Policy Optimization (GRPO) and value-free advantage estimation
  • Emergent long chain-of-thought (CoT) and self-correction mechanisms
  • Dynamic test-time compute scaling based on input difficulty
  • Cold-start initialization and rejection sampling for stability
  • Knowledge distillation of reasoning trajectories into smaller models
  • Multi-stage alignment pipeline (RL → SFT → Preference RL)
  • External risk-control systems for jailbreak mitigation

Key Equations and Algorithms

  • GRPO Objective Function: Maximizes the clipped importance-sampled policy advantage across all outputs in a sampled group while penalizing distributional divergence from a reference model via a term.
  • Group-Based Advantage Estimation: Calculates per-sample advantage using intra-group reward normalization () to approximate policy gradients without requiring a separate value network.
  • Rule-Based Reward Formulation: Assigns a composite score by summing binary accuracy rewards (verifying output against ground truth or compiler rules) and format consistency rewards for structured tasks.
  • Pairwise Ranking Reward Model: Trains a helper model using a cross-entropy loss over preference pairs to predict scalar helpfulness scores while explicitly controlling for positional and response length biases.
  • Multi-Stage Post-Training Pipeline: Sequences cold-start RL, language consistency-constrained RL, rejection-sampled SFT, and secondary preference-aligned RL to bootstrap reasoning emergence while correcting readability, safety, and instruction-following misalignments.

Key Claims and Findings

  • Bypassing initial supervised fine-tuning and applying pure RL to a capable base model effectively stimulates the emergence of unconstrained, high-fidelity reasoning strategies without human-bias constraints.
  • GRPO matches or exceeds PPO performance on reasoning benchmarks while offering superior training stability and significantly lower memory overhead by eliminating the value model requirement.
  • Trained models exhibit dynamic test-time compute scaling, automatically extending their reasoning token budget proportionally to problem complexity while remaining concise for trivial tasks.
  • Distilling R1’s trajectories into open-weight models down to 1.5B parameters yields massive reasoning gains, routinely surpassing non-reasoning proprietary baselines like GPT-4o on math and coding tasks.
  • The base R1 model maintains a moderate intrinsic safety profile, but integrating an external risk-control system elevates its resistance to jailbreak attacks and cross-lingual misalignment to state-of-the-art levels.

Terminology

  • DeepSeek-R1-Zero: The initial RL-only variant trained exclusively from a base checkpoint using cold-start prompts and rule-based rewards, without preliminary supervised fine-tuning.
  • Language Consistency (LC) Reward: A supplementary RL training signal calculated as the proportion of target-language tokens in the chain-of-thought to penalize code-switching and enforce monolingual coherence.
  • Test-Time Compute Scaling: The model’s learned capability to dynamically allocate generation length and token counts at inference time based on the perceived difficulty of the input prompt.
  • Reward Hacking: The phenomenon where a policy model discovers loopholes or exploits biases in a reward function to maximize objective scores without developing genuine capability or aligned behavior.
  • Pass@k Evaluation: A confidence-based metric where the model generates independent reasoning samples per query, and success is recorded if at least one of the outputs matches the ground truth.

Connections to Existing Wiki Pages