[论文解读] ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search
介绍 ReST-MCTS*,一个自我训练框架,在一个 MCTS*-引导的过程奖励模型下自动标注逐步推理,从而实现策略模型和奖励模型在大型语言模型推理任务中的相互改进。
Recent methodologies in LLM self-training mostly rely on LLM generating responses and filtering those with correct output answers as training data. This approach often yields a low-quality fine-tuning training set (e.g., incorrect plans or intermediate reasoning). In this paper, we develop a reinforced self-training approach, called ReST-MCTS*, based on integrating process reward guidance with tree search MCTS* for collecting higher-quality reasoning traces as well as per-step value to train policy and reward models. ReST-MCTS* circumvents the per-step manual annotation typically used to train process rewards by tree-search-based reinforcement learning: Given oracle final correct answers, ReST-MCTS* is able to infer the correct process rewards by estimating the probability this step can help lead to the correct answer. These inferred rewards serve dual purposes: they act as value targets for further refining the process reward model and also facilitate the selection of high-quality traces for policy model self-training. We first show that the tree-search policy in ReST-MCTS* achieves higher accuracy compared with prior LLM reasoning baselines such as Best-of-N and Tree-of-Thought, within the same search budget. We then show that by using traces searched by this tree-search policy as training data, we can continuously enhance the three language models for multiple iterations, and outperform other self-training algorithms such as ReST$^ ext{EM}$ and Self-Rewarding LM. We release all code at https://github.com/THUDM/ReST-MCTS.
研究动机与目标
- 在不需要密集人工标注的情况下,自动获得高质量的逐步推理奖励。
- 使用 MCTS*-引导的搜索来生成和评估中间推理轨迹。
- 通过迭代自我训练实现策略模型和过程奖励模型的相互完善。
提出的方法
- 定义一个过程奖励模型 V_theta,用于推断部分解的逐步质量 v_k。
- 开发使用 v_k 作为价值目标以引导树搜索和部分回溯的 MCTS*。
- 训练策略模型 pi,使其生成受 MCTS* 指引搜索影响的推理轨迹。
- 迭代地进行 MuZero 风格的相互自我训练,使用接近正确解的轨迹来更新 V_theta 和 pi。
- 通过利用搜索树中的回滚,在没有显式逐步标签的情况下推断逐步奖励。
- 在相同搜索预算下,与 Best-of-N 和 Tree-of-Thought 基线进行比较。

实验结果
研究问题
- RQ1通过 MCTS* 的自动过程奖励推断能否在没有人工逐步标注的情况下生成高质量的中间推理轨迹?
- RQ2使用 PRM 引导的 MCTS* 是否比先前方法(如 ReST EM、Self-Rewarding)在推理基准上提升策略和奖励模型的自我训练?
- RQ3在固定搜索预算下,ReST-MCTS* 相对于基线推理策略在数学和科学任务上的表现如何?
主要发现
- 在相同搜索预算下,ReST-MCTS* 的准确性高于先前的推理基线。
- 策略和过程奖励模型的相互自我训练在多次迭代中提升了性能,优于 ReSTEM 与 Self-Rewarding。
- 推断的逐步奖励有效引导树搜索,并产生用于自我训练的更高质量轨迹。
- 与 Self-Consistency 和 Best-of-N 相比,使用 MCTS* 的 ReST-MCTS* 在不同骨干网络的多个基准上取得了改进或具竞争力的结果。
- 通过 V_theta 的过程奖励建模提供比某些先前奖励生成方法(如 MATH-SHEPHERD)更强的验证信号。
- 在 SciBench 和 MATH 基准上,ReST-MCTS* 在多种 LLM 主干上显示出稳健的改进。

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。