Chapter 11 of DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Abstract

This chapter establishes the comprehensive safety evaluation framework for DeepSeek-R1, analyzing its robustness against harmful content, multilingual disparities, and adversarial jailbreaking attacks. The central technical contribution is the quantification of safety metrics, specifically the “Unsafe” and “Rejection” rates, under varying risk control configurations to demonstrate the efficacy of Reinforcement Learning with Human Feedback (RLHF) and safety alignment strategies. The evaluation confirms that while DeepSeek-R1 achieves state-of-the-art multilingual safety scores with risk control, it relies heavily on system-level filtering to mitigate jailbreak vulnerabilities. These findings are critical for understanding the safety-performance trade-offs in reasoning-oriented large language models.

Key Concepts

  • Safety Evaluation Metrics: The chapter defines two primary quantitative indicators for model safety: the Unsafe metric, representing the proportion of unsafe responses among all answers, and the Reject metric, representing the proportion of rejection responses. Lower values are preferred for both, though a preference is expressed for informative safe responses over rejections to ensure useful risk warnings. These metrics are calculated across fine-grained categories including Discrimination, Illegal, Harmful, and Ethical scenarios.
  • Risk Control System: A system-level intervention (introduced in Section D.3.1) applied post-generation to filter outputs. The evaluation compares model performance with and without this system, revealing that base models exhibit significantly higher unsafe rates (e.g., DeepSeek-R1 without control exceeds 20% unsafe rate) while controlled versions achieve lower unsafe rates at the cost of higher rejection rates (e.g., ~25% rejection).
  • Multilingual Safety Performance: An assessment extending beyond bilingual (Chinese/English) testing to 50 diverse languages. The evaluation utilizes a safety scoring system where safe responses receive 5 points, rejections 4 points, and unsafe responses 0 points. This concept highlights the necessity of verifying safety alignment across low-resource and high-resource languages to prevent language-specific vulnerabilities.
  • Jailbreaking Robustness: The resistance of the model to adversarial prompts designed to circumvent safety alignment. The chapter constructs a test suite of 2,232 jailbreaking instructions concatenated with unsafe questions. This concept measures the “GAP” between the model’s performance on original unsafe questions versus those modified with jailbreak elements.
  • LLM-as-a-Judge Evaluation: A methodology employed to determine safety labels for question-answer pairs without human intervention. This approach ensures consistency with human assessments (consistency rate > 95%) and allows for scalable evaluation across thousands of queries in multiple languages.
  • Reasoning vs. Safety Trade-offs: The observation that reasoning models (DeepSeek-R1, o1) rely more heavily on risk control systems for security checks compared to non-reasoning models. This results in considerably higher rejection rates specifically during attack scenarios (e.g., 87.3% rejection for R1 under jailbreak), suggesting a design preference for safety over helpfulness in adversarial contexts.

Key Equations and Algorithms

  • Safety Score Calculation: To evaluate multilingual performance, a weighted scoring system is defined as , where denotes the count of responses in each category. This formula prioritizes informative safe content while acknowledging rejections as safer than harmful outputs, normalized as a percentage of the total possible safety score.
  • Unsafe Rate Definition: The probability of generating harmful content is expressed as , where is the total number of queries processed. This metric serves as the primary indicator of model safety, with lower values indicating better safety performance across the tested scenarios.
  • Rejection Rate Definition: The tendency of the system to decline a query is formalized as . In the context of this chapter, a lower rejection rate is generally desirable to maintain user experience, provided it does not compromise the Unsafe Rate.
  • Jailbreak Performance Gap: The degradation in safety under adversarial conditions is quantified by the equation . This value indicates the robustness of the model, where a positive GAP signifies an increased vulnerability when exposed to jailbreaking instructions compared to standard unsafe queries.
  • Multilingual Dataset Construction Algorithm: The procedure involves translating the original bilingual safety testset into 50 languages using a combined LLM translation and human-assisted calibration approach. For high-frequency languages, the dataset is fully translated, while low-frequency languages undergo sampling translation, resulting in a comprehensive test set of 9,330 questions.

Key Claims and Findings

  • DeepSeek-R1 Multilingual Safety Parity: With the risk control system in place, DeepSeek-R1 achieves a safety score of 85.9% across 50 languages, which approaches the best-performing Claude-3.7-Sonnet score of 88.3%, demonstrating state-of-the-art system-level multilingual safety capabilities.
  • Impact of Risk Control on Base Models: Without the risk control system, DeepSeek-R1 exhibits an unsafe rate beyond 20%, classifying it as a relatively unsafe model, whereas the same architecture with risk control reduces the overall unsafe rate to approximately 8.5% in Table 10, highlighting the critical dependency on external safety filters.
  • Reasoning Model Rejection Behavior: Reasoning models such as DeepSeek-R1 and o1 (2024-12-17) demonstrate a higher reliance on risk control for security, resulting in rejection rates of 87.3% and 79.8% respectively during jailbreak attacks, significantly higher than non-reasoning model counterparts.
  • Jailbreak Vulnerability Across Architectures: All tested models, including frontier closed-source models like GPT-4o and Claude-3.7-Sonnet, exhibit significantly increased unsafe response rates when facing jailbreak attacks, with Claude-3.7-Sonnet showing a 33.8% decrease in the proportion of safe responses under attack conditions.
  • Open-Source Model Security Challenges: Open-source models like DeepSeek and Qwen face more severe jailbreak security challenges than closed-source models when deployed locally without the vendor’s risk control system, necessitating the adoption of comparable external risk control measures for service providers.
  • Category-Specific Safety Performances: DeepSeek-R1 performs exceptionally well in handling queries related to Illegal and Moral/Ethical Issues, but shows average performance in Discrimination and Harmful Behavior scenarios, identifying these categories as priority areas for future safety feature development.
  • High-Risk Language Identification: Among the 50 languages evaluated, DeepSeek-R1 (without risk control) and Claude-3.7-Sonnet possess zero high-risk languages (score < 60), distinguishing them from DeepSeek-V3 and GPT-4o which exhibit one and two high-risk languages respectively in similar configurations.

Terminology

  • Unsafe: A classification metric representing the proportion of model responses that contain harmful, illegal, or ethically problematic content; in the context of this chapter, lower values indicate superior safety performance.
  • Reject (Rejection Rate): The proportion of queries for which the model declines to provide an answer, typically employed as a safety mechanism to avoid generating harmful content; lower values are generally preferred to maintain informativeness.
  • Risk Control System: A system-level filter or post-processing layer (introduced in D.3.1) applied to the model’s output to detect and block unsafe content before it reaches the user, distinct from the model’s internal alignment.
  • Jailbreaking: A technique employed by malicious users to circumvent a model’s safety alignment protocols and elicit harmful responses, often involving the insertion of specific adversarial instructions or prompt engineering templates.
  • LLM-as-a-Judge: An evaluation methodology where a large language model is tasked with assigning safety labels (safe, unsafe, or rejected) to question-answer pairs, validated against human assessments to ensure high consistency (>95%).
  • High-Risk Languages: A categorization within the multilingual evaluation defined as languages where the model’s safety score falls below 60 points, indicating a significant vulnerability to generating unsafe content in that linguistic context.
  • Fine-Grained Safety Scenarios: Specific subcategories of safety evaluation including Discrimination, Illegal, Harmful, and Ethical issues, used to provide nuanced analysis beyond a binary safe/unsafe classification.
  • Safety Score Proportion: A normalized metric expressed as a percentage of the total possible safety score, calculated using the 5/4/0 point system for safe/reject/unsafe responses respectively across different language groups.