Chapter 10 of DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Abstract

This chapter details the comprehensive safety evaluation framework and risk control mechanisms deployed for DeepSeek-R1, emphasizing the integration of keyword-based filtering and model-based risk review systems. It presents empirical results across six standard safety benchmarks and a newly constructed in-house taxonomy comprising 28 subcategories across four major safety domains. The central argument establishes that while DeepSeek-R1 achieves safety performance comparable to frontier models on general benchmarks, specific vulnerabilities regarding intellectual property rights remain a critical area for mitigation. This analysis is vital for the book’s progression as it contextualizes the model’s reasoning capabilities within the necessary constraints of ethical alignment and service safety.

Key Concepts

Keyword-Based Risk Flagging This initial filtering layer automatically matches incoming queries against a predefined keyword list containing terms common in ethical and safety scenarios to ensure comprehensive coverage of potential safety issues. Conversations matching these keywords are immediately flagged as potentially unsafe dialogues, initiating a secondary review process to balance efficiency with effectiveness. The system relies on this deterministic approach to catch high-risk content before it reaches the more computationally expensive model-based review stage, serving as the first line of defense against harmful inputs.

Model-Based Risk Review Flagged dialogues are concatenated with a preset risk review prompt and sent to the DeepSeek-V3 model to determine if the dialogue should be retracted based on specific risk review results. The risk review prompt is meticulously designed to effectively cover various safety scenarios and maintain good scalability across different types of potentially unsafe content. This mechanism allows the system to leverage the reasoning capabilities of a larger model to make nuanced safety decisions that simple keyword matching cannot achieve, significantly improving service safety.

Standard Safety Benchmarks The evaluation utilizes six publicly available datasets including Simple Safety Tests (SST), Bias Benchmark for QA (BBQ), Anthropic Red Team (ART), XSTest, Do-Not-Answer (DNA), and HarmBench. These datasets cover diverse aspects such as illegal items, physical harm, scams, discrimination, hate speech, and excessive safety constraints. By using a broad scope of security-related topics, the authors ensure a comprehensive evaluation that prevents the model from excelling in one area while failing in others.

In-House Safety Taxonomy The authors constructed an internal safety evaluation dataset to monitor the overall safety level of the model using unified taxonomic standards covering various safety and ethical scenarios. This taxonomy categorizes potential content safety challenges into four major categories and 28 subcategories, including discrimination, illegal behavior, harmful behavior, and moral issues. The dataset possesses good extensibility, allowing subsequent evaluations on multilingual and jailbreak attacks to be based on extensions of this foundational structure.

LLM-as-a-Judge Evaluation The evaluation methodology employs an advanced GPT version, specifically GPT4o (2024-11-20), to determine safety labels for the constructed test sets. This automated approach allows for the systematic assessment of 1,120 test questions across different safety dimensions without human intervention. The process categorizes each QA pair into Unsafe, Safe, or Rejection classes, providing a scalable method for quantitative safety assessment.

Safety Labeling Protocol Responses are classified into three distinct classes: Unsafe if the model fails to meet ethical standards, Safe if it acknowledges risks and offers cautions, and Rejection if the response is an irrelevant refusal. This protocol ensures that relevant refusals or helpful safety warnings are distinguished from harmful completions. By uniformly categorizing refusals triggered by the risk control system as safe responses, the evaluation reflects the effective protection offered by the deployment architecture.

Key Equations and Algorithms

Algorithm 1: Keyword Risk Flagging The procedure involves matching the input query against a predefined keyword list containing commonly used terms in ethical and safety scenarios. If a match is found, the system flags the conversation as potentially unsafe; if no match is found, the query proceeds to the model-based review or standard processing. This algorithm operates with time complexity relative to the query length assuming a hash-based lookup for the keyword list.

Algorithm 2: Risk Review Determination Potentially unsafe dialogues are concatenated with a preset risk review prompt and sent to the DeepSeek-V3 model. The model outputs a classification determining whether the dialogue should be retracted. This step adds computational overhead proportional to the model size but ensures high-fidelity safety decisions for flagged content.

Algorithm 3: In-House Taxonomy Construction For each of the 28 subcategories in the safety taxonomy, 20 Chinese test questions are manually created to cover important concepts and risk points. These questions are translated into English versions to assess safety performance in both languages. The final dataset aggregates test questions.

Algorithm 4: Safety Benchmark Scoring For Do-Not-Answer and HarmBench, results are reproduced based on the official evaluation methodology, while other results are obtained from the HELM platform. During reproduction, if a model fails to refuse a request (e.g., generating lyrics for ‘What a Wonderful World’), it is classified as unsafe. Refusals triggered by the internal risk control system are uniformly categorized as safe responses in the final score calculation.

Algorithm 5: Evaluation Label Assignment Each QA pair is assessed by the LLM-as-a-Judge to assign one of three labels: Unsafe (clear negative sample), Safe (acknowledges risks), or Rejection (irrelevant refusal). This algorithmic assignment ensures consistency in scoring across the 1,120 in-house questions. The label determines the contribution of the specific query to the overall safety score percentage.

Key Claims and Findings

The DeepSeek-R1 model achieves comparable safety performance with other frontier models across different standard benchmarks when excluding the intellectual property specific failures on HarmBench. Table 9 indicates that DeepSeek-R1 achieves a safety score of 97.5% on SST, 96.2% on BBQ, and 96.3% on ART, demonstrating robust performance in general safety categories. However, a significant performance gap exists on the HarmBench benchmark where R1 scores 89.3% compared to 95.3% for GPT-4o, primarily due to failures in refusing queries related to generating copyrighted lyrics.

The addition of a risk control system significantly improves the overall safety of services, particularly against dangerous tactics such as jailbreak attacks. Without this system, the risk control system’s influence on the safety score is notable, as indicated by the parentheses in Table 9 where pure model scores (e.g., 67.0 for HarmBench) drop significantly compared to the deployed system scores. The in-house safety taxonomy successfully categorizes potential content safety challenges into 4 major categories and 28 subcategories, enabling quantitative safety assessments for different safety scenarios.

The risk control system recommends implementation by developers deploying DeepSeek-R1 for services to mitigate ethical and safety concerns associated with the model. Developers can achieve more flexible security protection by customizing safety standards within the risk review pipelines. The study observes that using relatively smaller models like LLaMA-2-13B for HarmBench evaluation led to unreliable outcomes, necessitating the use of more advanced models like GPT4o for scoring reliability.

Consistent with the findings on discrimination and bias, violence and extremism, privacy violations, and illegal behavior, R1 shows strong safety measures across these benchmarks. The safety taxonomy addresses discrimination based on personal physical attributes such as age, gender, sexual orientation, appearance, and health status. It also encompasses social attribute discrimination including nationality, ethnicity, religion, economic status, educational background, and family background.

Terminology

SST (Simple Safety Tests): A benchmark covering evaluations in five categories: Illegal Items, Physical Harm, Scams & Fraud, Child Abuse, and Suicide, Self-Harm & Eating Disorders.

BBQ (Bias Benchmark for QA): A benchmark evaluating performance in conversations involving discriminatory biases across dimensions such as age, disability, gender, and race.

ART (Anthropic Red Team): A benchmark consisting of data collected during Red Team attacks covering discrimination, hate speech, violence, nonviolent unethical behavior, and harassment.

XSTest: A benchmark examining security vulnerabilities and the risk of excessive safety constraints to ensure the model does not refuse legitimate questions.

DNA (Do-Not-Answer): A benchmark designed around dangerous instructions covering twelve categories of harm and 61 specific risk types such as racial discrimination and medical advice.

HarmBench: A benchmark structured around standard model safety, copyright-related safety, context-aware safety, and multimodal safety capabilities with automated red-teaming.

LLM-as-a-Judge: An evaluation methodology utilizing an advanced language model to determine safety labels and assess model performance automatically.

Risk Review Prompt: A preset prompt designed to cover various safety scenarios and maintain good scalability when concatenated with potentially unsafe dialogues for model analysis.

Jailbreak Attacks: Dangerous tactics or queries designed to bypass safety constraints, against which the risk control system is specifically tested and shown to improve.

Safety Taxonomy: A hierarchical classification system organizing safety challenges into four major categories and 28 subcategories for systematic evaluation.

Rejection: A classification in the safety labeling protocol indicating a response that is an irrelevant refusal, distinct from providing harmful content.

HELM: An independent third-party evaluation platform (crfm.stanford.edu/helm) from which results for several safety benchmarks were obtained for comparison.