QUICK REVIEW

[論文レビュー] Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance

Yao Fu, Litu Ou|arXiv (Cornell University)|May 26, 2023

Topic Modeling被引用数 15

ひとこと要約

Chain-of-Thought Hub は、複数の LLM に跨る多段階推論能力を追跡するオープンソースの評価スイートであり、推論性能に対するスケールと RLHF の影響を示します。主要モデル（GPT、Claude、PaLM）を、CoT prompting を用いた複数のベンチマークでオープンソースのものと比較します。

ABSTRACT

As large language models (LLMs) are continuously being developed, their evaluation becomes increasingly important yet challenging. This work proposes Chain-of-Thought Hub, an open-source evaluation suite on the multi-step reasoning capabilities of large language models. We are interested in this setting for two reasons: (1) from the behavior of GPT and PaLM model family, we observe that complex reasoning is likely to be a key differentiator between weaker and stronger LLMs; (2) we envisage large language models to become the next-generation computational platform and foster an ecosystem of LLM-based new applications, this naturally requires the foundation models to perform complex tasks that often involve the composition of linguistic and logical operations. Our approach is to compile a suite of challenging reasoning benchmarks to track the progress of LLMs. Our current results show that: (1) model scale clearly correlates with reasoning capabilities; (2) As of May 2023, Claude-v1.3 and PaLM-2 are the only two models that are comparable with GPT-4, while open-sourced models still lag behind; (3) LLaMA-65B performs closely to code-davinci-002, indicating that with successful further development such as reinforcement learning from human feedback (RLHF), it has great potential to be close to GPT-3.5-Turbo. Our results also suggest that for the open-source efforts to catch up, the community may focus more on building better base models and exploring RLHF.

研究の動機と目的

LLMs がスケールし進化するにつれて、複雑な推論の評価の必要性を動機づける。
実世界の使用と整合した高品質で多様な推論ベンチマークのスイートを構築する。
継続的でオープンリソースの枠組みを提供し、進捗を追跡しオープンモデル開発を促進する。

提案手法

数学・知識・推論に関連するデータセットを選定する（GSM8k, MATH, MMLU, BigBench Hard, HumanEval, C-Eval）。
解答のみの prompting ではなく few-shot chain-of-thought prompting を用いて LLMs を評価する。
6つのベンチマークで19モデル（GPT、Claude、PaLM、LLaMA、T5 系列）の性能を集計・比較する。
モデルを主に GSM8k の最終解答精度でランク付けし、サブタスクの結果を報告する。
オープンソースとクローズドソースの性能を区別し、モデルの規模と RLHF の効果を分析する。

実験結果

リサーチクエスチョン

RQ1主要な LLM ファミリ間でのモデル規模が多段階推論性能とどのように相関するか？
RQ2オープンソースモデルは、チェーン・オブ・思考推論タスクにおいてクローズド/事前学習済みモデルとどのように比較されるか？
RQ3RLHF（人間のフィードバックによる強化学習）が推論能力に与える影響は何か？
RQ4どのデータセットやタスクタイプが、強力なモデルと弱いモデルを最も明確に区別するか？

主な発見

Model	#Params	Type	GSM8k	MATH	MMLU	BBH	HumanEval	C-Eval
GPT-4	?	RLHF	92.0	42.5	86.4	-	67.0	68.7*
claude-v1.3	?	RLHF	81.8*	-	74.8*	67.3*	-	54.2*
PaLM-2	?	Base	80.7	34.3	78.3	78.1	-	-
gpt-3.5-turbo	?	RLHF	74.9*	-	67.3*	70.1*	48.1	54.4*
claude-instant-v1.0	?	RLHF	70.8*	-	-	66.9*	-	54.9*
text-davinci-003	?	RLHF	-	-	64.6	70.7	-	-
code-davinci-002	?	Base	66.6	19.1	64.5	73.7	47.0	-
Minerva	540B	SIFT	58.8	33.6	-	-	-	-
Flan-PaLM	540B	SIFT	-	-	70.9	66.3	-	-
Flan-U-PaLM	540B	SIFT	-	-	69.8	64.9	-	-
PaLM	540B	Base	56.9	8.8	62.9	62.0	26.2	-
text-davinci-002	?	SIFT	55.4	-	60.0	67.2	-	-
PaLM	64B	Base	52.4	4.4	49.0	42.3	-	-
LLaMA	65B	Base	50.9	10.6	63.4	-	23.7	38.8*
LLaMA	33B	Base	35.6	7.1	57.8	-	21.7	-
LLaMA	13B	Base	17.8	3.9	46.9	-	15.8	-
Flan-T5	11B	SIFT	16.1*	-	48.6	41.4	-	-
LLaMA	7B	Base	11.0	2.9	35.1	-	10.5	-
Flan-T5	3B	SIFT	13.5*	-	45.5	35.2	-	-

モデル規模は一般に推論性能と相関があり、対数線形の傾向を示す。
RLHF 後の主要モデル（GPT、Claude、PaLM）は CoT Hub ベンチマークで優位に立ち、オープンソースモデルは高度なスケールがない限り遅れをとる。
LLaMA-65B はいくつかのタスクで code-davinci-002 に近い性能を示し、RLHF とさらなる洗練が GPT-3.5 レベルの性能へのギャップを埋め得ることを示唆。
2023年5月時点で、PaLM-2 と Claude-2 は GPT-4 に匹敵する唯一のモデルファミリであり、オープンソースモデルはまだそれらに及ばない。
オープンソースモデルは上限が LLaMA-65B に近いことを示しており、オープンモデルとクローズドモデルのギャップを埋めるには規模の拡大と RLHF が有効である可能性を示す。
CoT Hub は、オープンソース LLMs を改善するための2つの明確な方向性を特定している：より良いベースモデルの開発と RLHF の推進。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。