Chapter 1 of DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Abstract

This chapter introduces the DeepSeek-R1 series of large language models (LLMs), specifically focusing on the mechanism of incentivizing reasoning capabilities through pure reinforcement learning (RL) without supervised fine-tuning (SFT). The central technical contribution is the development of DeepSeek-R1-Zero, which employs Group Relative Policy Optimization (GRPO) to train on reasoning tasks using only final answer correctness as a reward signal. This approach facilitates the emergent development of advanced reasoning patterns, such as self-reflection and verification, allowing the model to surpass human performance on verifiable benchmarks like AIME 2024 without relying on human-labeled reasoning trajectories. The chapter further details the transition to DeepSeek-R1, which aligns these reasoning capabilities with human preferences through a multi-stage learning framework, establishing a scalable pathway for complex cognitive tasks in artificial intelligence.

Key Concepts

  • DeepSeek-R1-Zero Architecture: This refers to the baseline model trained exclusively via reinforcement learning on the DeepSeek-V3-Base foundation, bypassing any prior supervised fine-tuning phase. The motivation for this design is to prevent human-defined reasoning patterns from capping the model’s exploration of superior, non-human-like reasoning pathways. By removing SFT constraints, the model is free to evolve its own strategy for problem-solving, which leads to the emergence of self-reflection and dynamic strategy adaptation during the training process.
  • Group Relative Policy Optimization (GRPO): GRPO is the specific reinforcement learning algorithm adopted to train both DeepSeek-R1-Zero and DeepSeek-R1, chosen to simplify the training process and reduce resource consumption compared to Proximal Policy Optimization (PPO). For each input question , the algorithm samples a group of outputs from the old policy and optimizes the policy model by maximizing an objective function relative to a reference policy . This method allows for efficient scaling on large batches of reasoning problems while maintaining stability through group-relative advantage estimation.
  • Rule-Based Reward System: The reward signal for DeepSeek-R1-Zero is derived entirely from rule-based mechanisms rather than neural reward models, relying on the correctness of final predictions against ground-truth answers. This system consists of Accuracy rewards, which verify the final answer via deterministic checks or compiler test cases, and Format rewards, which enforce the encapsulation of reasoning within specific tags like ```. This design choice avoids the susceptibility of neural reward models to reward hacking during large-scale reinforcement learning and reduces computational complexity.
  • Emergent Reasoning Capabilities: Through the unrestricted RL training process, the model naturally develops sophisticated reasoning behaviors without explicit instruction, including the tendency to generate longer responses that incorporate verification and reflection. The model learns to flag moments of reevaluation, often using an anthropomorphic tone, such as “Wait, wait,” to identify and correct errors in its chain of thought. These behaviors emerge spontaneously as the policy maximizes the reward signal, indicating that the model internalizes complex cognitive strategies through the reinforcement signal.
  • Training Dynamics and Scaling: The training process for DeepSeek-R1-Zero exhibits a significant performance jump at the 8.2k step, where both reasoning accuracy and average response length increase substantially before continuing for a total of 10,400 steps. Each training step processes 32 unique questions with a batch size of 512, utilizing a learning rate of and a KL coefficient of 0.001 to balance policy updates against the reference model. This specific configuration ensures that the model can explore long-context reasoning up to 65,536 tokens while maintaining computational efficiency.
  • DeepSeek-R1 Multi-Stage Framework: While DeepSeek-R1-Zero excels at reasoning, it suffers from poor readability and language mixing, necessitating the development of DeepSeek-R1 through a multi-stage pipeline involving rejection sampling, reinforcement learning, and supervised finetuning. This pipeline allows the model to inherit the robust reasoning capabilities of its predecessor while aligning its behavior with human preferences through the inclusion of additional non-reasoning data. The result is a model that maintains high performance on STEM tasks while offering improved general utility in writing and open-domain question answering.
  • Model Distillation Strategy: To enable broader access and lower energy costs, the reasoning capabilities of the large-scale DeepSeek-R1 series are distilled into several smaller publicly available models. These distilled models inherit the advanced reasoning traits, such as long chain-of-thought generation, enabling them to surpass the performance of their original instruction-tuned counterparts on reasoning benchmarks. This strategy facilitates community research into the mechanisms underlying long chain-of-thought (CoT) reasoning models and promotes the development of more powerful reasoning systems at a reduced computational footprint.
  • Human-Annotation Dependence Limitation: Traditional reasoning augmentation relies heavily on human-annotated demonstrations via Chain-of-Thought (CoT) prompting or supervised fine-tuning, which introduces cognitive biases and limits scalability. By relying on human-provided exemplars, prior methods inherently cap the model’s performance based on the quality of human thought processes, preventing the exploration of novel or superior reasoning pathways. The DeepSeek-R1 approach specifically addresses this by removing the dependency on human-labeled reasoning trajectories, allowing the model to self-evolve beyond the constraints of human demonstration.

Key Equations and Algorithms

  • GRPO Policy Objective: Although the full objective function is visually omitted, the optimization maximizes the policy using variables including the reference policy , hyper-parameters and , and the advantage . The objective relies on these components to update the model parameters based on the relative performance of sampled outputs within a group, ensuring the policy does not drift too far from while maximizing reward.
  • Group Advantage Calculation: The advantage is computed using a group of rewards corresponding to the outputs within each group for a given question . This calculation normalizes the rewards within the group to determine the relative benefit of a specific output compared to the group mean, driving the policy update towards higher-reward trajectories without needing an external critic model.
  • Reward Function Composition: The total reward for a response is the combination of Accuracy rewards and Format rewards with equal weights, defined as . evaluates the correctness of the final answer or code against test cases, while assigns positive feedback only if the reasoning process is enclosed within designated and tags. This weighted sum guides the model to simultaneously solve the problem and maintain a structured interpretability format.
  • Sampling and Rollout Protocol: The algorithm samples 16 outputs per question with a maximum length of 32,768 tokens before step 8.2k and 65,536 tokens afterward, split into 16 mini-batches for training. During each rollout, 8,192 outputs are generated and randomly split, with the reference model replaced by the latest policy model every 400 steps to ensure the training signal remains relevant. This protocol ensures that the training batch size remains 512 (32 questions 16 outputs) across 10,400 total training steps.
  • DeepSeek-R1-Zero Template: The training template enforces a specific structure where the response must begin with reasoning process followed by answer final answer ````. This template is applied uniformly across mathematical, coding, and logical reasoning domains to ensure that the model explicitly delineates its thought process, facilitating the Format reward calculation and subsequent analysis of reasoning patterns.

Key Claims and Findings

  • DeepSeek-R1-Zero achieves a pass@1 accuracy of on the AIME 2024 benchmark after 10,400 training steps, representing a jump from an initial accuracy without RL training.
  • When leveraging self-consistency decoding with 16 samples (Cons@16), the model’s performance on AIME 2024 increases further to , significantly surpassing the average performance across all human competitors.
  • The performance of DeepSeek-R1-Zero exhibits a significant jump specifically at the 8.2k step, simultaneously increasing both the reasoning accuracy and the average response length per sample.
  • The model naturally develops diverse reasoning behaviors such as self-reflection and verification without explicit supervision, evidenced by the generation of anthropomorphic corrections within the chain of thought.
  • DeepSeek-R1-Zero demonstrates remarkable performance in coding competitions and graduate-level biology, physics, and chemistry problems, verifying the generalizability of the RL-induced reasoning capabilities across STEM fields.
  • Neural reward models were deliberately excluded from the training pipeline to prevent reward hacking and reduce computational complexity, relying instead on precise rule-based verification of final answers.

Terminology

  • DeepSeek-R1-Zero: The name of the base model trained exclusively via reinforcement learning without a prior supervised fine-tuning phase, capable of emergent reasoning behaviors.
  • DeepSeek-R1: The final model iteration that integrates the reasoning capabilities of DeepSeek-R1-Zero with human preference alignment through a multi-stage pipeline of rejection sampling, RL, and supervised finetuning.
  • GRPO (Group Relative Policy Optimization): An RL algorithm used to optimize the policy model by comparing groups of outputs from the old policy against a reference policy, reducing resource consumption compared to PPO.
  • Chain-of-Thought (CoT) Prompting: A technique involving few-shot examples or minimalistic prompts like “Let’s think step by step” to enable models to produce intermediate reasoning steps, used here as a baseline comparison.
  • Self-Consistency Decoding: A decoding strategy (referenced as Cons@16) where multiple reasoning paths are sampled and the final answer is determined by the most frequent answer among them, used to boost verification accuracy.
  • AIME 2024: The American Invitational Mathematics Examination 2024, used as the primary mathematical benchmark to evaluate the reasoning capabilities of the models during training.
  • Pass@1: A metric representing the probability of a single generated sample correctly solving a problem, used to measure the initial reasoning accuracy of DeepSeek-R1-Zero.
  • Format Rewards: A component of the reward system that incentivizes the model to wrap its reasoning process within specific tags, ensuring interpretability and standardizing the output structure.