Chapter 6 of DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Abstract
This chapter delineates the data engineering and evaluation methodologies underpinning the supervised fine-tuning (SFT) phase of the DeepSeek-R1 model. It details the construction of an 800,000-sample dataset comprising reasoning and non-reasoning domains, utilizing adversarial input generation to test time complexity constraints. The section establishes the protocol for using DeepSeek-V3 as a generative reward model to evaluate reasoning trajectories via rejection sampling. Technically, it specifies the mathematical principles for optimization problems used in training data, such as Lagrange multipliers for minimization tasks, and defines the structural constraints for model thinking processes to ensure clarity and user context alignment.
Key Concepts
-
Adversarial Input Generation: This technique involves constructing large, diverse inputs designed to exceed execution time limits in incorrect code implementations. The method employs Python functions to generate strings of specific lengths between and (e.g., 90,000 to 100,000 characters). Variants include repeated characters, alternating character patterns (e.g., ), and sequential alphabetical characters, all intended to stress-test algorithmic efficiency.
-
Generative Reward Model: DeepSeek-V3 is utilized to assess the quality of reasoning answers by comparing model outputs against reference solutions. The prompt structure requires the model to classify answers into
correctorincorrectcategories based on reasoning alignment and conclusion accuracy. This replaces rule-based rewards in specific stages, allowing for semantic evaluation of complex reasoning traces that deterministic checkers cannot verify. -
Rejection Sampling for Reasoning Data: The reasoning dataset is curated by sampling multiple responses from a first-stage RL checkpoint and retaining only those evaluated as correct by the generative reward model. This filtering process ensures that the 600,000 reasoning samples submitted for SFT training exhibit high-quality, error-free chain-of-thought trajectories. Mixed-language outputs and chaotic formatting are excluded during this selection phase.
-
Thinking Process Design Principles: Specific guidelines govern the model’s internal reasoning output to enhance interpretability and utility. The principles mandate concise, digestible paragraphs with a natural, conversational tone. Crucially, the thinking process must begin by analyzing the complete user context, including unstated needs and situational constraints, to facilitate accurate interaction.
-
Non-Reasoning Data Composition: This category encompasses tasks such as creative writing, factual QA, self-cognition, translation, and software engineering (program repair, web development). Approximately 200,000 samples are collected, reusing parts of the DeepSeek-V3 SFT dataset. Unlike reasoning tasks, simpler queries do not require a chain-of-thought preamble, whereas complex engineering tasks may invoke generative thinking steps.
-
Chain-of-Thought (CoT) Prompting: Used for both evaluation and generation, CoT prompts guide the model to articulate steps explicitly before providing a final answer. Simple math problems, such as , are presented with structured responses including
Step 1,Step 2, and a boxed final answer. The prompt ensures the model identifies user intent and derives the solution logically before concluding. -
Multi-turn Interaction Limitations: The dataset statistics indicate that the majority of interactions are single-turn, limiting the model’s conversational capabilities in extended dialogues. The training data prioritizes single-round problem solving, leaving multi-turn dialogue expansion as a future work item. This constraint affects the model’s ability to maintain context over long conversational histories.
-
Verifiable Ground Truth: The mathematical and logical datasets consist of questions that are verifiable through deterministic rules or specific reference answers. This ensures that the reward signals and SFT targets are objective. Domains like Math and Code provide strict metrics for correctness, contrasting with open-ended creative tasks in the General domain.
Key Equations and Algorithms
-
Adversarial String Generation Algorithm: This procedure generates large input strings to test time complexity. It randomly selects a length within and constructs a sequence. For alternating characters, it samples and constructs the sequence (if is even) (if is odd). The final output includes the string and a parameter .
-
Arithmetic Series Summation: In the reasoning trajectory examples, the sum of odd integers is derived using standard arithmetic series properties. The equation is expressed as . The derivation shows this sum equals , which simplifies to . This identity is fundamental to solving the minimization problem involving vertical distances in the coordinate grid analogy.
-
Lagrangian Minimization Function: To minimize subject to , the Lagrangian function is formulated as . Differentiating with respect to yields the condition , establishing the relationship between the Lagrange multiplier and the variable structure.
-
Optimal Variable Derivation: Solving the derivative condition leads to the expression , where . This derivation relies on squaring the derivative equation to isolate and recognizing that is proportional to . The constant is determined by the constraint that the sum of all must equal 17.
-
Final Objective Function: Substituting the optimal back into the original expression yields . This is derived by factoring out and multiplying by the sum of vertical components . The resulting expression is used to check for integer solutions where must be a perfect square.
-
Evaluation Prompt JSON Structure: The generative reward model outputs a structured JSON object defined by two keys:
analysisandcorrectness. Theanalysiskey contains the detailed reasoning of the evaluation, while thecorrectnesskey is a string restricted to the set . This format ensures programmatic parsing of the quality assessment for the reward pipeline.
Key Claims and Findings
- The supervised fine-tuning dataset comprises approximately 800,000 samples, with a specific split of roughly 600,000 reasoning-related samples and 200,000 non-reasoning samples.
- Reasoning trajectories are filtered to remove mixed languages, long paragraphs, and code blocks to ensure consistency and readability in the training data.
- Domain distribution shows Math dominates with 395,285 samples, followed by Code with 211,129 samples, while STEM and Logic contribute smaller volumes of 10,124 and 10,395 samples respectively.
- The thinking process design explicitly requires the model to identify unstated user needs beneath the surface of the initial request to improve response accuracy.
- Human annotators verify the accuracy of the artificial reasoning traces, confirming that these traces enhance the model’s precision in interpreting user queries and format constraints.
- The majority of training data consists of single-turn interactions, which currently limits the development of multi-turn conversational capabilities in the resulting model.
- Non-reasoning tasks often adopt the DeepSeek-V3 pipeline, but simpler queries like greetings do not require a chain-of-thought reasoning step in the output.
Terminology
- SFT Data: Refers to the Supervised Fine-Tuning dataset used to train the model. In this context, it includes 800,745 entries categorized by domain (Math, Code, STEM, Logic, General), characterized by specific average token counts per sample.
- Generative Reward Model: A proxy model, specifically DeepSeek-V3, employed to judge the quality of answers by comparing them against reference solutions. It outputs a classification of
correctorincorrectbased on semantic alignment rather than exact string matching. - Thinking Process: The internal reasoning trace generated by the model before providing a final response. It is constrained by principles of conciseness, conversational tone, and contextual understanding to align with user needs.
- Adversarial Inputs: Large data constructs (e.g., strings of alternating characters) designed specifically to cause time-limit exceeded errors in inefficient code implementations. These inputs test the time complexity of candidate solutions.
- Rejection Sampling: The methodology used to curate reasoning data by generating multiple responses and discarding those identified as incorrect by the generative reward model. This ensures only high-quality trajectories enter the training set.
- Chain-of-Thought (CoT): A prompting technique that forces the model to output intermediate reasoning steps (e.g.,
Step 1,Step 2) alongside the final solution. In evaluation prompts, the CoT is used to generate ananalysisof correctness in JSON format.