QUICK REVIEW

[論文レビュー] Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Süzgün, Nathan Scales|arXiv (Cornell University)|Oct 17, 2022

Topic Modeling被引用数 42

ひとこと要約

本論文は BIG-Bench Hard (BBH) を導入し、23タスクのサブセットでモデルが平均的な人間評価より遅れていることを示し、チェーン・オブ・ソート prompting (CoT) が Codex や PaLM のような大規模モデルを多くのタスクで人間の平均を上回らせ、CoT がスケール依存的な能力として現れることを示しています。

ABSTRACT

BIG-Bench (Srivastava et al., 2022) is a diverse evaluation suite that focuses on tasks believed to be beyond the capabilities of current language models. Language models have already made good progress on this benchmark, with the best model in the BIG-Bench paper outperforming average reported human-rater results on 65% of the BIG-Bench tasks via few-shot prompting. But on what tasks do language models fall short of average human-rater performance, and are those tasks actually unsolvable by current language models? In this work, we focus on a suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH). These are the task for which prior language model evaluations did not outperform the average human-rater. We find that applying chain-of-thought (CoT) prompting to BBH tasks enables PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex (code-davinci-002) to surpass the average human-rater performance on 17 of the 23 tasks. Since many tasks in BBH require multi-step reasoning, few-shot prompting without CoT, as done in the BIG-Bench evaluations (Srivastava et al., 2022), substantially underestimates the best performance and capabilities of language models, which is better captured via CoT prompting. As further analysis, we explore the interaction between CoT and model scale on BBH, finding that CoT enables emergent task performance on several BBH tasks with otherwise flat scaling curves.

研究の動機と目的

BIG-Bench のうち現在の言語モデルにとって特に難しいタスクを特定する。
標準のfew-shot prompting を超える性能改善が、チェーン・オブ・ソート prompting によって得られるかを評価する。
複数のモデルファミリーにわたって、CoT prompting とモデル規模の相互作用を探る。
さらなる研究のために公開可能な BBH ベンチマークと prompting データを提供する。

提案手法

BBH をデータ品質、タスクタイプ、人間ベースラインの基準で BIG-Bench タスクをフィルタリングして定義する（結果として 23 タスクになる）。
Codex、InstructGPT、PaLM など、複数のモデルファミリーにわたって、標準の回答のみの few-shot prompting とチェーン・オブ・ソート (CoT) prompting を比較する。
CoT のプロンプトとして CoT と「let’s think step-by-step」というプロンプト句を 3 つの exemplars で使用する。
貪欲デコードと正確一致の精度で、複選択/正確一致タスクを評価する。
モデルサイズの拡大に伴う性能のスケーリングを分析し、CoT prompting の下で新たなタスク能力の出現を特定する。

実験結果

リサーチクエスチョン

RQ1標準 prompting で平均的な人間評価の下に留まる BBH タスクはどれか。
RQ2CoT prompting は BBH タスクの性能を改善するか、改善はスケール依存か。
RQ3モデルが大きくなるにつれて CoT prompting を用いたときにどのタスクで出現的な性能が現れるか。
RQ4より大きなモデルでも CoT prompting が回答のみ prompting を打ち負かせないタスクはあるか。

主な発見

CoT prompting により Codex は 23 の BBH タスク中 17 タスクで平均的な人間評価を上回る（回答のみの場合は 5/23）
CoT prompting を用いた PaLM 540B も顕著な利得を示し、いくつかのタスクで平均的な人間評価を上回る
CoT の利得はモデル規模に強く依存しており、出現的な改善は十分に大きなモデルでのみ現れる
一部のタスクでは CoT が改善をもたらさず、回答のみ prompting よりも下回ることがあり、CoT のタスク依存的な限界を浮き彫りにしている
いくつかのタスクは、回答のみ prompting で平坦なスケーリングを示していたが、モデルが大きくなるにつれて CoT により解法可能となり、出現的能力が現れる
CoT prompting は複数段階の推論やアルゴリズム的タスクを、特定の世界知識タスクよりも支援する傾向があり、性能向上は混在する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。