QUICK REVIEW

[論文レビュー] Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

Fengli Xu, Qiang Hao|ArXiv.org|Jan 16, 2025

Topic Modeling被引用数 14

ひとこと要約

この調査は reinforcement-based reasoning in LLMs の推論を強化学習ベースで検討し、データ構築、RLベースのトレーニング、そして大規模推論モデルへ向けたテスト時のスケーリングを扱い、OpenAIの o1 及びオープンソースの取り組みに言及します。

ABSTRACT

Language has long been conceived as an essential tool for human reasoning. The breakthrough of Large Language Models (LLMs) has sparked significant research interest in leveraging these models to tackle complex reasoning tasks. Researchers have moved beyond simple autoregressive token generation by introducing the concept of "thought" -- a sequence of tokens representing intermediate steps in the reasoning process. This innovative paradigm enables LLMs' to mimic complex human reasoning processes, such as tree search and reflective thinking. Recently, an emerging trend of learning to reason has applied reinforcement learning (RL) to train LLMs to master reasoning processes. This approach enables the automatic generation of high-quality reasoning trajectories through trial-and-error search algorithms, significantly expanding LLMs' reasoning capacity by providing substantially more training data. Furthermore, recent studies demonstrate that encouraging LLMs to "think" with more tokens during test-time inference can further significantly boost reasoning accuracy. Therefore, the train-time and test-time scaling combined to show a new research frontier -- a path toward Large Reasoning Model. The introduction of OpenAI's o1 series marks a significant milestone in this research direction. In this survey, we present a comprehensive review of recent progress in LLM reasoning. We begin by introducing the foundational background of LLMs and then explore the key technical components driving the development of large reasoning models, with a focus on automated data construction, learning-to-reason techniques, and test-time scaling. We also analyze popular open-source projects at building large reasoning models, and conclude with open challenges and future research directions.

研究の動機と目的

人間のような推論の必要性を動機づけ、スケーラブルな推論モデルの追求を促す。
LLM駆動の自動化によって人間の注釈への依存を低減するデータ構築アプローチを調査する。
学習から推論へ技術（学習から推論への技術）をレビューし、RL、PRMs、アラインメント手法を含む。
推論の正確性と頑健性を高めるテスト時のスケーリングとプロンプト戦略を検討する。

提案手法

LLM駆動の探索と自己改善を通じた自動データ構築について論じる。
LLM推論のための強化学習フレームワークを分析し、RLHF、RLAIF、Direct Preference Optimization (DPO) を含む。
推論を導くプロセス報酬モデル（PRMs）の役割を説明。
意図的な推論とPRM案内の探索によるテスト時のスケーリングを探る。
プロンプト戦略（CoT、思考の木/思考のグラフ、ReAct、分解法）とエージェント的ワークフローを説明。
大規模推論モデルのベンチマークとして、オープンソースプロジェクトと OpenAI の o1 シリーズをレビューする。

Figure 1 : Illustrating different paradigms for annotating LLM reasoning data.

実験結果

リサーチクエスチョン

RQ1訓練時の強化を通じてL L M の推論をスケールさせるのに最も適した学習信号とデータ構築法は何か。
RQ2テスト時の戦略とPRMsは推論の正確性と信頼性にどのように影響するか。
RQ3OpenAI の o1 およびオープンソースの取り組みから大規模推論モデルを進展させるために何を学べるか。

主な発見

強化学習とAI指導のデータ構築は、教師ありファインチューニングを超えるLLMの推論能力を大幅に拡張する。
プロセス報酬モデルは密度の高い、段階的なフィードバックを可能にし、トレーニング中の推論を改善する。
PRMs によって導かれるテスト時のスケーリングは、より慎重な中間思考を許すことで推論精度をさらに向上させる。
プロンプト戦略（CoT、思考の木/思考のグラフ、ReAct）とエージェント的ワークフローは問題解決と推論の網羅性を高める。
OpenAI の o1 およびいくつかのオープンソースプロジェクトは、スケーラブルな大規模推論モデルに向けた実用的な進展を示している。

Figure 2 : Reward models for Train-time Reinforcement of LLM Reasoning.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。