[论文解读] Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
这项综述回顾了基于强化学习的推理在大语言模型中的应用,涵盖数据构建、基于 RL 的训练,以及面向大型推理模型的测试时扩展,并提及 OpenAI 的 o1 与开源努力。
Language has long been conceived as an essential tool for human reasoning. The breakthrough of Large Language Models (LLMs) has sparked significant research interest in leveraging these models to tackle complex reasoning tasks. Researchers have moved beyond simple autoregressive token generation by introducing the concept of "thought" -- a sequence of tokens representing intermediate steps in the reasoning process. This innovative paradigm enables LLMs' to mimic complex human reasoning processes, such as tree search and reflective thinking. Recently, an emerging trend of learning to reason has applied reinforcement learning (RL) to train LLMs to master reasoning processes. This approach enables the automatic generation of high-quality reasoning trajectories through trial-and-error search algorithms, significantly expanding LLMs' reasoning capacity by providing substantially more training data. Furthermore, recent studies demonstrate that encouraging LLMs to "think" with more tokens during test-time inference can further significantly boost reasoning accuracy. Therefore, the train-time and test-time scaling combined to show a new research frontier -- a path toward Large Reasoning Model. The introduction of OpenAI's o1 series marks a significant milestone in this research direction. In this survey, we present a comprehensive review of recent progress in LLM reasoning. We begin by introducing the foundational background of LLMs and then explore the key technical components driving the development of large reasoning models, with a focus on automated data construction, learning-to-reason techniques, and test-time scaling. We also analyze popular open-source projects at building large reasoning models, and conclude with open challenges and future research directions.
研究动机与目标
- 推动在 LLM 中实现类似人类的推理以及追求可扩展推理模型的需求。
- 综述通过 LLM 驱动的自动化来减少对人工注释依赖的数据构建方法。
- 回顾学习以推理的技术,包括 RL、PRMs 和对齐方法。
- 检验测试时的扩展与提示策略,以提升推理准确性和鲁棒性。
提出的方法
- 讨论通过 LLM 驱动的搜索与自我提升实现的自动化数据构建。
- 分析用于 LLM 推理的强化学习框架,包括 RLHF、RLAIF,以及 Direct Preference Optimization (DPO)。
- 解释过程奖励模型(PRMs)在引导推理中的作用。
- 探索通过深思熟虑的推理与 PRM 指导的搜索来实现测试时的扩展。
- 描述提示策略(CoT、tree/graph-of-thoughts、ReAct、分解方法)以及代理化工作流。
- 评述开源项目与 OpenAI 的 o1 系列,作为大型推理模型的基准。

实验结果
研究问题
- RQ1哪些学习信号和数据构建方法在训练时强化下能最好地扩展 LLM 推理能力?
- RQ2测试时策略与 PRMs 如何影响推理的准确性与可靠性?
- RQ3从 OpenAI 的 o1 与开源努力中可以学到哪些,以推进大型推理模型?
主要发现
- 强化学习与 AI 引导的数据构建显著扩展了 LLM 推理能力,超过了有监督微调。
- 过程奖励模型能够提供密集的逐步反馈,从而在训练中改善推理。
- 在 PRMs 指导下的测试时扩展,通过允许更有目的的中间推理,可以进一步提升推理准确性。
- 提示技术(CoT、tree/graph-of-thoughts、ReAct)与代理化工作流提升了解决问题的能力与推理覆盖率。
- OpenAI 的 o1 以及若干开源项目展示了朝向可扩展大型推理模型的实际进展。

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。