QUICK REVIEW

[論文レビュー] A Comparative Study on Reasoning Patterns of OpenAI's o1 Model

Siwei Wu, Zhongyuan Peng|arXiv (Cornell University)|Oct 17, 2024

Business Process Modeling and Analysis被引用数 10

ひとこと要約

本論文はOpenAIの o1 モデルを、数学、コーディング、常識タスク全体での複数の実行時計算手法と比較し、六つの o1 推論パターンを分析し、コードとデータを公開します。

ABSTRACT

Enabling Large Language Models (LLMs) to handle a wider range of complex tasks (e.g., coding, math) has drawn great attention from many researchers. As LLMs continue to evolve, merely increasing the number of model parameters yields diminishing performance improvements and heavy computational costs. Recently, OpenAI's o1 model has shown that inference strategies (i.e., Test-time Compute methods) can also significantly enhance the reasoning capabilities of LLMs. However, the mechanisms behind these methods are still unexplored. In our work, to investigate the reasoning patterns of o1, we compare o1 with existing Test-time Compute methods (BoN, Step-wise BoN, Agent Workflow, and Self-Refine) by using OpenAI's GPT-4o as a backbone on general reasoning benchmarks in three domains (i.e., math, coding, commonsense reasoning). Specifically, first, our experiments show that the o1 model has achieved the best performance on most datasets. Second, as for the methods of searching diverse responses (e.g., BoN), we find the reward models' capability and the search space both limit the upper boundary of these methods. Third, as for the methods that break the problem into many sub-problems, the Agent Workflow has achieved better performance than Step-wise BoN due to the domain-specific system prompt for planning better reasoning processes. Fourth, it is worth mentioning that we have summarized six reasoning patterns of o1, and provided a detailed analysis on several reasoning benchmarks.

研究の動機と目的

推論時戦略がLLMの推論性能に与える影響を理解する動機づけ。
多様な推論ベンチマークにおいて o1 を従来のテスト時計算手法と比較評価する。
o1 が示す推論パターン（タイプ）をタスクを横断して特徴づける。
今後のファウンデーションモデルと推論戦略の開発を指南する洞察を提供する。

提案手法

GPT-4o をバックボーンとして、BoN、Step-wise BoN、Agent Workflow、Self-Refine を4つのベンチマークで比較する。
HotpotQA、Collie、USACO、AIME で4つのベースラインと4つのテスト時計算手法を採用する。
複数のLLM（Llama3、Qwen、Claude、Yi）による投票アプローチで常識データをフィルタリングする。
六つの o1 推論パターンを分析する：Systematic Analysis、Method Reuse、Divide and Conquer、Self-Refinement、Context Identification、Emphasizing Constraints。
推論の正確さ/スコアをタスクごとに計算・比較し、Step-wise BoN のトークン使用を検証する。
参照されたGitHubリポジトリでコードとデータセットを公開する。

実験結果

リサーチクエスチョン

RQ1o1 のパフォーマンスはBoN、Step-wise BoN、Agent Workflow、Self-Refine と比較して数学、コーディング、常識タスクでどうか。
RQ2どの推論時戦略が o1 の利点を最大化し、制限は何か？
RQ3o1 が示す6つの推論パターンは何で、タスクによってどう異なるのか？
RQ4推論のためのトークン使用はタスクと戦略によってどう異なるのか？
RQ5BoN型手法における報酬モデルと探索空間の有効性を制約する要因は何か？

主な発見

設定	ベースライン	N	全体	HotpotQA	Collie	USACO	AIME
Direct	o1-preview	-	34.32	14.59	34.07	44.60	44.00
	o1-mini	-	35.77	15.32	53.53	12.23	62.00
	GPT4o	-	18.44	13.14	43.36	5.04	12.22
BoN	4	-	17.65	13.50	39.82	5.04	12.22
BoN	8	-	19.04	16.42	38.50	7.91	13.33
Step-wise BoN	1	-	6.09	13.50	5.31	0.00	5.56
Step-wise BoN	4	-	9.79	15.69	19.55	0.00	7.78
Self-Refine	3	-	5.62	13.25	0.00	0.00	9.23
Agent Workflow	-	-	24.70	14.96	46.07	22.22	15.56

OpenAI の o1 は一般にベンチマークで最良の結果を達成し、特に数学とコーディングタスクで高い性能を示す。
Agent Workflow は全ベンチマークで性能を大幅に向上させ、o1 レベルに近づく一方、BoN、Step-wise BoN、Self-Refine は利得が限られる。
六つの o1 推論パターンを特定: Systematic Analysis、Method Reuse、Divide and Conquer、Self-Refinement、Context Identification、Emphasizing Constraints; DC と SR が全体で最も顕著。
Step-wise BoN は長い推論チェーンと多くのトークンを生み出し、効果はタスク要件と文脈長に依存。
報酬モデルと探索空間のサイズは BoN 型手法を大きく制約し、人の報酬は HotpotQA で顕著な改善を示す。
推論トークン長はタスクによって異なることが多く、入力プロンプト長だけとは単純には相関しない。難しいタスク（コーディング/数学）はより多くの推論トークンを必要とする傾向。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。