Ch. 9 — Training Details

Chapter 9 of DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Abstract

This chapter evaluates the performance, safety, and human alignment of DeepSeek-R1 following reinforcement learning (RL) training. It presents comprehensive benchmark results comparing DeepSeek-R1 against DeepSeek-V3 and closed-source counterparts like OpenAI-o1-1217 across mathematics, coding, and general reasoning tasks. Additionally, the chapter details the safety infrastructure, including a risk control system utilizing DeepSeek-V3 to filter unsafe content, and reports on human preference rankings via ChatbotArena. The findings demonstrate that DeepSeek-R1 achieves superior educational knowledge performance and parities with leading closed-source models in reasoning, while establishing a robust framework for safety assurance in open-source development.

Key Concepts

Reasoning-Centric Reinforcement Learning: DeepSeek-R1 employs large-scale reinforcement learning to enhance reasoning capabilities, specifically targeting STEM-related questions. This training approach differentiates it from DeepSeek-V3, resulting in significant gains on educational benchmarks such as MMLU and MMLU-Pro. The methodology prioritizes accuracy in complex problem-solving domains over generic instruction following, leveraging the incentivization of reasoning patterns during the training phase.
Human Parity and Exceedance in Benchmarks: The model demonstrates performance metrics that surpass or match human competitors in specific domains. On the AIME 2024 mathematics competition, DeepSeek-R1 exceeds the mean score of human participants. Conversely, in GPQA Diamond, human experts with Ph.D.-level qualifications and web access remain superior. The Codeforces benchmark indicates that DeepSeek-R1 outperforms 96.3% of human participants, highlighting advanced algorithmic problem-solving capabilities beyond average human proficiency.
ChatbotArena Style Control Ranking: Evaluated using the ChatbotArena platform, DeepSeek-R1 achieves a first-place ranking alongside OpenAI-o1 and Gemini-Exp-1206 under the “style control” setting. This setting isolates response style (length, formatting) from substantive content to ensure fairness. The achievement is notable for an open-source model under the MIT License, indicating that inference cost and architecture efficiency do not preclude top-tier human preference alignment.
Safety Risk Control System: Beyond intrinsic model safety, an external risk control system is deployed for the official DeepSeek-R1 service. This system utilizes DeepSeek-V3 to review against following every conversation round. The workflow involves inputting the User Question, Model Response, and Safety Standards into a Risk Review Prompt to determine compliance. This external layer filters unsafe content before it reaches the user, ensuring adherence to universal values and local policies.
Multilingual and Instruction Compliance: DeepSeek-R1 demonstrates strong multilingual capabilities and instruction-following adherence, particularly on IF-Eval benchmarks. Improvements in following format instructions are linked to instruction-following data inclusion in the final stages of Supervised Fine-Tuning (SFT) and RL. The model also excels in Chinese benchmarks such as CLUEWSC and C-Eval, achieving scores comparable to or exceeding DeepSeek-V3, validating its cross-lingual reasoning robustness.
Open-Domain Task Performance: In addition to specialized reasoning, DeepSeek-R1 delivers impressive results on open-domain writing and QA tasks, as measured by AlpacaEval2.0 and ArenaHard. These benchmarks reflect the model’s ability to maintain coherence and helpfulness in unconstrained scenarios. The integration of reasoning capabilities into general assistance tasks ensures that the model remains competitive in broad utility applications beyond strict logic puzzles.

Key Equations and Algorithms

DeepSeek-R1 Risk Control Workflow: The safety assurance procedure involves a sequential decision process defined in Listing 8. First, the system reads the and to understand requirements. Second, it analyzes the against the using the standards to detect violations of clauses [1] through [11]. The output format requires a <judge_reason> and <target_rule>, returning a list of violated standard numbers or [-1] if compliant. This algorithmic review ensures systemic safety enforcement prior to response deployment.
ChatbotArena Ranking Mechanism: While specific mathematical formulas for Elo are not provided, the ranking method adapts the Elo rating system from chess to predict win rates based on pairwise outcomes. The platform employs a bootstrap-like technique shuffling vote data across permutations to compute reliable Elo scores. Additionally, the Bradley-Terry model is adopted to refine rankings by estimating win probabilities across all battles, leveraging the full vote history to improve stability and incorporate new models efficiently.

Key Claims and Findings

DeepSeek-R1 demonstrates performance on par with OpenAI-o1-1217 on math tasks, specifically on the AIME 2024 benchmark where it surpasses mean human scores.
On the Codeforces platform, DeepSeek-R1 outperforms 96.3% of human participants, confirming its advanced algorithmic problem-solving capabilities.
In GPQA Diamond, human experts (Ph.D.-level with web access) outperform DeepSeek-R1, though web access for the model is anticipated to potentially close this performance gap in future iterations.
DeepSeek-R1 shares the first position in ChatbotArena style control ranking with OpenAI-o1 and Gemini-Exp-1206 as of January 24, 2025, marking a milestone for open-source models.
Engineering-oriented coding performance on Aider is lower than OpenAI-o1-1217 due to currently limited RL training data for engineering tasks, but performance on SWE Verified is comparable.
The safety of the official DeepSeek-R1 service is enhanced by an external risk control system using DeepSeek-V3 to filter risks based on 11 specific safety standards.

Terminology

Pass@1: A metric used in benchmarks like AIME 2024 and MATH-500 representing the probability that the model’s single sampled answer is correct. It measures reasoning accuracy without multiple attempts.
Style Control: A feature in ChatbotArena separating the influence of a model’s response style (e.g., length, tone) from its substantive content (e.g., accuracy) when evaluating rankings to prevent gaming the system.
Elo Rating System: A method adapted from chess used by ChatbotArena to rank models based on pairwise comparison outcomes, predicting win rates between competitors.
GPQA Diamond: A benchmark focusing on graduate-level physics, chemistry, and biology questions, where human scores correspond to Ph.D.-level individuals with web access.
Risk Review Prompt: The specific structured prompt (Listing 8) sent to DeepSeek-V3 that defines the role, workflow, and safety standards for evaluating a for compliance.
Bradley-Terry Model: A probability model used in ChatbotArena to refine rankings by estimating win probabilities across all battles, leveraging the full vote history for stability.

Personal Wiki

Explorer

Ch. 9 — Training Details

Abstract

Key Concepts

Key Equations and Algorithms

Key Claims and Findings

Terminology

Graph View

Table of Contents

Backlinks