Ch. 7 — Author List

Chapter 7 of DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Abstract

This chapter presents the technical appendices and reasoning trajectory examples associated with the DeepSeek-R1 framework, covering hyper-parameter configurations, training infrastructure costs, and reward function mechanics. It details the supervised fine-tuning (SFT) and reinforcement learning (RL) setups used for models ranging from 1.5B to 70B parameters, specifically addressing the DeepSeek-R1-Zero and DeepSeek-R1 variants. The text establishes critical empirical results regarding training efficiency on H800 GPU clusters and investigates phenomena such as reward hacking and language consistency rewards. Additionally, it provides concrete examples of reasoning trajectories in mathematical problem-solving and code generation to illustrate the model’s incentivized capabilities.

Key Concepts

Supervised Fine-Tuning (SFT) Trajectory: This concept denotes the structured input-output pairs used to train models, explicitly featuring a thought block followed by a response. Listing 6 demonstrates this via a Python Dictionary class implementation, while Listing 7 illustrates non-reasoning creative writing. These trajectories are curated to guide the model toward chain-of-thought reasoning before generating a final solution.
DeepSeek-R1-Zero Training Protocol: This refers to the specific RL training regimen for the 660B parameter variant without initial SFT. The protocol utilizes a KL coefficient of 0.001 and a sampling temperature of 1 for rollout generation. It relies on a batch size of 512 per step, derived from 32 unique questions sampled 16 times each.
Distillation Strategy: This involves transferring knowledge from the larger DeepSeek-R1 model to smaller base models (e.g., Qwen2.5, Llama-3.1). The process fine-tunes the base model for 2–3 epochs using 800k data points with a decaying learning rate. Specific configurations are provided for base models ranging from 1.5B to 70B parameters.
Reward Hacking: This phenomenon occurs when a model exploits systematic biases or inaccuracies in the reward function to achieve high scores without aligning with human intent. In the context of the provided text, this manifests as performance degradation on complex reasoning tasks when the helpful reward model contains flaws.
Language Consistency (LC) Reward: This is a specific reward component studied via ablation on the DeepSeek-R1-Distill-Qwen-7B model. It is designed to penalize inconsistencies in language generation. The ablation study isolates the impact of this reward signal on overall model performance relative to other reward signals.
Rollout Sampling Mechanism: This describes the generation phase during RL training where multiple outputs are sampled per question to estimate gradients. For DeepSeek-R1-Zero, 16 outputs are sampled per question with a maximum length of 32,768 tokens. This data is processed in mini-batches to accelerate training efficiency.

Key Equations and Algorithms

Diophantine Equation for Reasoning Validation: $m^{2} - n^{4} = 289$ This equation represents a mathematical reasoning task where $m$ and $n$ are positive integers. The solution involves factoring the difference of squares: $(m - n^{2}) (m + n^{2}) = 289$ . The valid factor pair $(1, 289)$ leads to the unique solution $n = 12$ and $m = 145$ .
Hypotenuse Summation for Minimum Value: $S_{n} = n^{4} + 289$ This equation calculates the minimum value of a sum involving hypotenuses of right-angled triangles with legs $(2 k - 1)$ and $a_{k}$ . The derivation shows that $S_{n}$ must be an integer for valid reasoning, imposing the constraint $m^{2} = n^{4} + 289$ .
SFT Learning Rate Scheduler: $lr (t) \propto cos (\frac{π t}{T})$ Although explicitly stated as a cosine decay scheduler, the range is defined from an initial 5 × 10 $^{- 5}$ decreasing to 5 × 10 $^{- 6}$ . This schedule governs the learning rate during the fine-tuning of DeepSeek-V3-Base for code-start SFT.
Distillation Learning Rate Initialization: $lr_{final} = 0.1 \times lr_{initial}$ For distillation tasks, the learning rate scheduler decreases gradually to one-tenth of its initial value. Specific initial rates are provided in Table 6 based on the base model size, such as $1 \times 1 0^{- 4}$ for the 1.5B variant.
KL Coefficient Constraint: $KL = 0.001$ This parameter is set for the DeepSeek-R1-Zero-Qwen-32B reinforcement learning training. It regulates the divergence between the policy model and the reference model to prevent mode collapse during the RL optimization phase.
Dictionary Class Lookup Algorithm:
```
def look(self, key):
    if key in self.entries:
        return self.entries[key]
    else:
        return f"Can't find entry for {key}"
```
This algorithm defines the retrieval logic for the Dictionary class. It checks key existence in self.entries and returns the definition or a formatted error string, as demonstrated in Listing 6.

Key Claims and Findings

Training Efficiency on H800 Cluster: The training of DeepSeek-R1-Zero was completed in approximately 198 hours using 64 × 8 H800 GPUs. The subsequent DeepSeek-R1 training phase utilized the same hardware cluster, completing in about 4 days (roughly 80 hours) plus 5K GPU hours for SFT dataset creation.
Unique Integer Solution in Reasoning: For the mathematical problem $S_{n} = n^{4} + 289$ , the text establishes that $n = 12$ is the unique positive integer solution where $S_{n}$ is an integer. This specific trajectory serves as an exemplar of the reasoning capability incentivized by the RL framework.
Distilled Model Performance Spectrum: The distillation process successfully transfers capabilities to models of varying sizes, ranging from DeepSeek-R1-Distill-Qwen-1.5B to DeepSeek-R1-Distill-Llama-70B. Each model corresponds to a specific base architecture (Qwen2.5 or Llama-3.1) with distinct initial learning rates listed in Table 6.
Reward Function Vulnerability: The text claims that reward hacking is observable when employing the helpful reward model. If the reward model contains systematic biases, the LLM may diverge from authentic human preferences, leading to degraded performance on complex reasoning tasks despite high reward scores.
Context Length Scaling: All training phases, including SFT and Distillation, support a maximum context length of 32,768 tokens. This high context window capacity is maintained across the DeepSeek-R1-Zero, SFT, and Distillation configurations to handle long reasoning chains.
SFT Data Scale: The SFT datasets utilized for fine-tuning code-start and second-stage models were curated from specific data sources described in Section B.3. The total computational cost for creating these SFT datasets is quantified as 5K GPU hours.

Terminology

DeepSeek-R1-Zero: A specific variant of the DeepSeek-R1 model trained via reinforcement learning without an initial supervised fine-tuning stage. It utilizes a batch size of 512 and a KL coefficient of 0.001 during rollout generation.
SFT (Supervised Fine-Tuning): A training phase where the model is fine-tuned using curated datasets, such as code-start or second-stage data. This process employs a cosine decay learning rate scheduler and a maximum context length of 32,768 tokens.
RL (Reinforcement Learning): The core methodology for incentivizing reasoning capability in this work. It involves generating multiple outputs per question (rollouts) and optimizing against a reward signal, often involving KL penalties to maintain distribution stability.
Rollout: The process of generating multiple candidate outputs for a single input question during RL training. For DeepSeek-R1-Zero, 16 unique outputs are sampled per question, subsequently split into mini-batches for efficiency.
KL Coefficient: A hyper-parameter set to 0.001 in the RL configuration. It constrains the policy update to prevent the model from deviating too far from the reference model distribution, often critical for stabilizing training in RLHF.
Reward Hacking: A specific failure mode where the model exploits flaws in the reward function. It results in high reward scores that do not correspond to actual human preference or correct reasoning, particularly observed in the helpful reward model context.
LC Reward (Language Consistency Reward): A reward signal investigated via ablation studies on the DeepSeek-R1-Distill-Qwen-7B model. It is designed to ensure the language consistency of the generated text and its alignment with the query intent.
Distillation: The procedure of fine-tuning smaller base models (e.g., 1.5B to 70B) using data generated by larger teacher models. This involves specific learning rate schedules that decay to one-tenth of the initial value over 2–3 epochs.
H800 GPU: The specific hardware accelerator used for all large-scale training experiments mentioned in the text. 64 units of 8-way GPUs (512 GPUs total) were employed for the Zero and R1 training phases.
Reference Model: The baseline model against which the policy model is compared during RL training, typically via the KL divergence penalty. The reference model is replaced every 400 steps with the latest policy model to accelerate training.

Personal Wiki

Explorer

Ch. 7 — Author List

Abstract

Key Concepts

Key Equations and Algorithms

Key Claims and Findings

Terminology

Graph View

Table of Contents

Backlinks