QUICK REVIEW

[论文解读] rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

Xinyu Guan, Li Lyna Zhang|arXiv (Cornell University)|Jan 8, 2025

Machine Learning and Data Classification被引用 7

一句话总结

论文表明，通过自我进化的深度思维结合蒙特卡罗树搜索、代码增强的CoT数据合成方法，以及通过成对排序训练的过程偏好模型，较小语言模型可以达到或超过 OpenAI o1-风格的数学推理。

ABSTRACT

We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising "deep thinking" through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data sythesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids naïve step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs' math reasoning to state-of-the-art levels. On the MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% the brightest high school math students. Code and data will be available at https://github.com/microsoft/rStar.

研究动机与目标

证明较小的语言模型在不进行更大模型蒸馏的情况下，能够与最先进的数学推理基准相匹配或超越。
开发一个自我进化工作流程，迭代地改进用于数学推理的策略模型和奖励模型。
提出一种代码增强的思考链数据合成方法，以生成可靠的逐步轨迹。
引入一个通过成对排序训练的过程偏好模型（PPM），以提供密集、可靠的逐步奖励。
展示在多个数据集和模型规模上的实证提升，接近甚至超越在数学基准上的大型前沿模型。

提出的方法

使用蒙特卡罗树搜索（MCTS）结合策略 SLM 和过程奖励模型（PRM）来进行数学推理的深度思考。
引入代码增强的 CoT 生成，其中每一步还产生 Python 代码；仅保留已执行代码的生成，以确保中间步骤的有效性。
通过广泛的 MCTS 展开来标注步骤质量以分配 Q 值，结合终止和 PRM 增强的策略来细化步骤评分。
使用成对 Bradley-Terry 排名对高 Q 和低 Q 步骤进行训练来得到过程偏好模型（PPM），避免直接依赖嘈杂的逐步分数。
实现四轮自我进化循环，利用 747k 道数学题的种子数据集从零开始逐步增强策略 SLM 和 PPM。
在 MATH、AIME、AMC、Olympiad Bench 及其他基准上对 1.5B–7B SLM 进行评估，并与 OpenAI o1 及其他基线进行比较。

实验结果

研究问题

RQ1较小的 LLM 能否在不依赖更大模型蒸馏的情况下实现前沿数学推理？
RQ2策略与奖励模型的自我进化如何缩小多步数学问题求解的差距？
RQ3代码增强的 CoT 数据合成是否改善推理步骤的轨迹质量？
RQ4使用成对排序训练的过程偏好模型是否能为数学推理提供可靠的逐步奖励？
RQ5在不同数学基准上增加 MCTS 轨迹对性能有何影响？

主要发现

rStar-Math 将 7B 规模模型提升到具有挑战性的数学基准的前沿水平，例如 MATH 分数趋近或超过 OpenAI o1 同等水平。
在 MATH 上，Qwen2.5-Math-7B 在 64 条轨迹下从 58.8% 提升到 90.0%，超过 o1-preview 并与 o1-mini 相当。
在 AIME 2024 上，rStar-Math 平均达到 53.3%（8/15 道题），在顶尖 20% 的高中数学天才中名列前茅。
四轮自我进化，凭借数百万条合成解，产生更强的策略模型和 PPMs，将覆盖范围扩展到 747k 题中的 90.25%。
带 Python 执行的代码增强 CoT 以及基于 MCTS 的 Q 值标注降低中间步骤错误并提升轨迹质量。
通过成对排序训练的过程偏好模型（PPM）提供可靠的逐步指引，并在消融实验中优于基线奖励模型方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。