Chapter 4 of DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Abstract

This chapter details the experimental validation of the DeepSeek-R1 framework, contrasting conventional post-training paradigms with the proposed reinforcement learning approach. It demonstrates that standard Supervised Fine-Tuning (SFT) can impede the development of effective reasoning strategies due to static target bias, motivating the DeepSeek-R1-Zero model which relies on direct exploration. The technical core establishes Group Relative Policy Optimization (GRPO) as the primary RL algorithm, comparing its resource efficiency and performance against Proximal Policy Optimization (PPO) on the MATH task while outlining the decoupled RL infrastructure required for large-scale training.

Key Concepts

  • Conventional Post-Training Paradigm: A two-stage framework involving Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL), widely adopted to align LLMs with human expectations. SFT utilizes curated input-output pairs to minimize cross-entropy loss, ensuring precise task alignment and computational efficiency, though its static nature may limit generalization to novel contexts or evolving preferences.
  • SFT Limitations in Reasoning: The text argues that SFT may hinder the model’s exploration of reasoning strategies because human-provided responses often omit critical reflection and verification steps. This dependency on non-optimal human priors restricts the model from discovering more robust reasoning trajectories independently of explicit supervision.
  • DeepSeek-R1-Zero: A model variant designed to bypass the SFT stage, enabling direct exploration of reasoning patterns by the model itself without human priors. The reasoning trajectories discovered through this self-exploration are subsequently distilled to train other models, promoting the acquisition of generalizable reasoning capabilities.
  • Group Relative Policy Optimization (GRPO): An RL algorithm adopted to train DeepSeek-R1-Zero and DeepSeek-R1, designed to simplify training and reduce resource consumption compared to PPO. For each query , GRPO samples a group of outputs from the old policy and optimizes the current policy by maximizing a specific objective function that relies on group scores for advantage estimation.
  • Value Model Omission: A critical architectural deviation in GRPO which foregoes the value model typically required in PPO. By estimating advantages directly from group scores rather than a learned value function, GRPO significantly reduces memory requirements and computational overhead, avoiding the difficulty of predicting cumulative rewards for partial long-chain-of-thought responses.
  • Periodic Reference Policy Update: A stabilization technique employed in GRPO where the reference policy is periodically updated to the latest policy during training. This balances the scope of exploration allowed by the training policy with the stability of the optimization process, preventing significant divergence that could occur over thousands of training steps.
  • Decoupled RL Infrastructure: An extensible framework partitioned into Rollout, Inference, Rule-based Reward, and Training modules. This design facilitates the integration of diverse models and allows for independent optimization of each phase, such as offloading model instances from VRAM to memory between stages to save resources.
  • Multi-Token Prediction (MTP): A decoding acceleration technique leveraged within the Rollout Module for self-speculative decoding. This component significantly accelerates the decoding speed, minimizing the completion time for the longest samples, which is critical for efficiency in large-scale reinforcement learning pipelines.
  • Asynchronous Reward Scheduling: A latency-hiding strategy used in the Rule-based Reward Module to overlap execution with Rollout and Inference phases. Although code execution and format checking are time-consuming, this approach ensures they do not block the GPU-bound inference and training processes, improving overall system throughput.
  • Data Packing Strategy: An optimization in the Training Module to minimize sequence padding waste. Data is sorted by length and distributed across processes using a Best-Fit strategy to pack data into fixed-length chunks, ensuring the number of chunks is equal across all processes to balance the workload in data parallel groups.

Key Equations and Algorithms

  • GRPO Advantage Estimation: is computed using a group of rewards corresponding to outputs within each group. This relative advantage allows the policy to be optimized without a separate critic, utilizing the variance within the sampled group to normalize the signal.
  • GRPO Objective Function: The policy model is optimized by maximizing an objective involving , , and , where serves as the reference policy and controls the KL penalty. The objective incorporates an unbiased estimator of the KL divergence directly added to the loss.
  • PPO GAE Calculation: In PPO, advantages are typically computed by applying Generalized Advantage Estimation (GAE) based on rewards and a learned value model. This requires predicting the expected cumulative reward from the current position onward, which is challenging for long generation tasks.
  • KL Penalty Formulation: In GRPO, an unbiased estimator of the KL divergence is added directly in the loss, whereas PPO adds a per-token KL penalty as a dense reward. This distinction in GRPO prevents implicit penalties on response length that might arise from cumulative KL penalties in PPO.
  • RL Pipeline Execution: The algorithm follows a decoupled flow: Rollout (sampling) Inference (reward models) Rule-based Reward (code/format) Training (loss/updates). The procedure includes automatic offloading of VRAM between modules to manage memory constraints effectively.

Key Claims and Findings

  • SFT hinders models from exploring effective reasoning strategies because human responses often omit critical components like explicit reflection.
  • GRPO outperforms PPO on the MATH task when PPO uses default parameters (e.g., ), though careful tuning of to 1.0 can make PPO performance comparable.
  • Eliminating the value model in GRPO reduces memory and computational overhead while avoiding the difficulty of predicting final outcomes from intermediate steps in long chains-of-thought.
  • PPO is highly sensitive to the GAE coefficient and demands additional hyperparameter tuning compared to GRPO, which presents a more practical alternative for large-scale models.
  • The decoupled RL infrastructure supports efficient training by overlapping rule-based reward execution with inference and training phases to hide latency.
  • Periodic updates to the reference policy are necessary to balance exploration stability, preventing the trained policy from diverging significantly over thousands of training steps.

Terminology

  • GRPO: Group Relative Policy Optimization; an RL algorithm that estimates advantages from group output scores instead of a value model.
  • PPO: Proximal Policy Optimization; a standard RL algorithm that uses a value model and GAE to compute advantages for policy updates.
  • SFT: Supervised Fine-Tuning; a post-training step refining pre-trained LLMs on curated datasets to align with specific tasks or standards.
  • RLHF: Reinforcement Learning from Human Feedback; an RL approach where the reward function encodes human preferences to optimize model outputs.
  • Value Model: A neural network component used in PPO to predict expected cumulative rewards, which increases memory overhead and is difficult to train for long reasoning tasks.
  • Advantage (): A metric computed in GRPO using group rewards to determine the relative benefit of a specific output compared to the group average.
  • MTP: Multi-Token Prediction; a technique used within the Rollout Module to accelerate decoding via self-speculative prediction.
  • DualPipe: An algorithm integrated into the Training Module to achieve efficient pipeline parallelism, utilized previously in DeepSeek-V3 training.
  • Reference Policy: A baseline policy used to constrain the divergence of the training policy, updated periodically to maintain training stability.
  • vLLM: An inference engine used in the Rollout Module to dispatch prompts across workers and manage memory for efficient sampling.