QUICK REVIEW

[论文解读] Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen|ArXiv.org|Mar 26, 2025

Surgical Simulation and Training被引用 4

一句话总结

本论文批判性分析了基模型与 RL 在 R1-Zero 类训练中的作用，揭示 GRPO 的偏差，提出 Dr. GRPO，并展示一个最小化配方在 7B 模型上实现 AIME 2024 的最先进性能。

ABSTRACT

DeepSeek-R1-Zero has shown that reinforcement learning (RL) at scale can directly enhance the reasoning capabilities of LLMs without supervised fine-tuning. In this work, we critically examine R1-Zero-like training by analyzing its two core components: base models and RL. We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance. Our analysis reveals that DeepSeek-V3-Base already exhibit ''Aha moment'', while Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates, suggesting potential pretraining biases. Additionally, we identify an optimization bias in Group Relative Policy Optimization (GRPO), which artificially increases response length (especially for incorrect outputs) during training. To address this, we introduce Dr. GRPO, an unbiased optimization method that improves token efficiency while maintaining reasoning performance. Leveraging these insights, we present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing a new state-of-the-art. Our code is available at https://github.com/sail-sg/understand-r1-zero.

研究动机与目标

评估基模型预训练特征如何影响 R1-Zero 类训练中的 RL 性能。
识别影响模型长度与难度权重的 GRPO 优化偏差。
提出无偏优化（Dr. GRPO）以在不牺牲推理能力的情况下提高令牌效率。
探索模板、问题集覆盖与 RL 动态之间的交互作用。
展示一个最小化 RL 配方，在数学基准测试上取得强劲结果。

提出的方法

对 Qwen2.5、Llama-3.1、DeepSeek 变体等基模型在 500 道数学题上的全模型分析，以评估回答能力、探索性和自我反思。
分析 GRPO 优化偏差导致输出长度增加和问题难度权重偏置。
通过去除长度和标准差归一化项来提出 Dr. GRPO，以恢复无偏 PPO 目标。
使用 Oat 框架在基于数学的数据集和标准数学基准上进行带有 Dr. GRPO 的经验性 RL 实验。
研究模板与非模板效应以及问题集覆盖对 RL 动态的影响。
显示领域特定预训练在数学任务上提升 RL 上限的实验研究。

实验结果

研究问题

RQ1基模型的预训练特征是否会在 R1-Zero 类训练中偏置 RL 结果？
RQ2GRPO 是否引入长度与难度偏置，从而使输出变长或错误问题的权重偏高？
RQ3Dr. GRPO 能否提供无偏、令牌高效的 RL 优化而不牺牲推理性能？
RQ4模板与问题集覆盖如何交互影响 RL 动态与最终性能？
RQ5领域特定的预训练是否提高了在 R1-Zero 类训练中对数学推理的 RL 上限？

主要发现

Qwen2.5 基模型在不使用模板的情况下也能实现较高的回答率，表明对拼接的 QA 文本进行预训练的影响。
所有测试基模型在 RL 之前就具备数学求解能力，且许多在 RL 之前就已出现“顿悟”时刻。
Dr. GRPO 移除了长度与标准差归一化偏差，在保持推理性能的同时提升了令牌效率。
GRPO 的长度与难度偏置可能扭曲优化，导致输出更长且对问题的权重不均。
一个最小化的 RL 配方（Dr. GRPO 与 Qwen2.5-Math-7B 及 Math 级提示）在 modest compute 条件下取得强劲结果（AIME 2024 的最先进）。
领域特定的数学预训练（FineMath/NuminaQA）可以提高数学推理的 RL 上限。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。