QUICK REVIEW

[論文レビュー] Measuring and Narrowing the Compositionality Gap in Language Models

Ofir Press, Muru Zhang|arXiv (Cornell University)|Oct 7, 2022

Topic Modeling被引用数 22

ひとこと要約

本論文は言語モデルにおける組成性ギャップを定義し測定し、そのギャップがスケールに伴って縮小しないことを示し、誘発的 prompting（チェーン・オブ・ソートと self-ask）と self-ask + 検索エンジンのアプローチでギャップを縮小し、マルチホップQAの性能を改善する。

ABSTRACT

We investigate the ability of language models to perform compositional reasoning tasks where the overall solution depends on correctly composing the answers to sub-problems. We measure how often models can correctly answer all sub-problems but not generate the overall solution, a ratio we call the compositionality gap. We evaluate this ratio by asking multi-hop questions with answers that require composing multiple facts unlikely to have been observed together during pretraining. In the GPT-3 family of models, as model size increases we show that the single-hop question answering performance improves faster than the multi-hop performance does, therefore the compositionality gap does not decrease. This surprising result suggests that while more powerful models memorize and recall more factual knowledge, they show no corresponding improvement in their ability to perform this kind of compositional reasoning. We then demonstrate how elicitive prompting (such as chain of thought) narrows the compositionality gap by reasoning explicitly. We present a new method, self-ask, that further improves on chain of thought. In our method, the model explicitly asks itself (and answers) follow-up questions before answering the initial question. We finally show that self-ask's structured prompting lets us easily plug in a search engine to answer the follow-up questions, which additionally improves accuracy.

研究の動機と目的

LM がすべてのサブ問に正しく答える一方で、全体の組成的質問を誤答する頻度を定量化する（組成性ギャップ）。
モデルサイズ/スケールが組成的推論性能にどう影響するかを検討する。
ギャップを縮小し、マルチホップ質問応答を改善するための誘発的 prompting 手法を開発する。
組成的QAを強化するための実用的なプロンプティングと検索の戦略を提供する。

提案手法

組成性ギャップを測定するための2ホップ組成データセット（Compositional Celebrities, CC）を作成する。
CC上でGPT-3ファミリーモデルを評価し、ギャップがモデルサイズと prompting スタイルにどう比例するかを評価する。
誘発的 prompting（思考の連鎖）と、質問をフォローアップのサブ質問に分解する新しい self-ask prompting を導入する。
検索エンジンを用いた Self-ask の拡張（Self-ask + Search）で、取得によってサブ質問に答える。
直接 prompting、思考の連鎖、単純な検索ベースラインと比較し、複数データセット（CC、2WikiMultiHopQA、Musique、Bamboogle）で評価する。
該当する場合、正確性と効率（回答あたりのトークン数）を報告する。

実験結果

リサーチクエスチョン

RQ12ホップ組成的質問に対して、言語モデルのサイズが大きくなるにつれ組成性ギャップは縮小するか？
RQ2誘發的プロンプトは、直接 prompting や標準の Chain-of-Thought と比較して組成性ギャップを縮小できるか？
RQ3Self-ask に検索エンジンを組み合わせると、組成的質問応答はさらに改善されるか？
RQ4サブ回答に対するモデルの信頼度（パープレキシティ）は組成成功とどう関連するか？
RQ5CCを超える複数の組成QAデータセットで、提案手法はどのように機能するか？

主な発見

主要結果の表の見出し
Bamb. (Bamboogle)	2Wiki Multi-Hop QA	Musique	Direct prompting	17.6	25.4	5.6
Chain of Thought	46.4	29.8	12.6
Search	0.0	2.2	1.5
Search + postproc.	-	26.3	6.5
Self-ask	57.6	30.0	13.8
Self-ask + Search	60.0	40.1	15.2

組成性ギャップは、GPT-3モデルサイズと prompting の変化をまたいでおよそ40%前後で一定のままで、スケールの拡大によって減少しない。
サブ質問は高い正確さで答えられる一方、最終的な組成回答は遅れ、堅牢な組成よりも記憶への依存を示している。
誘発的 prompting（Chain of Thought）は直接 prompting と比べ組成的質問の性能を改善するが、self-ask は問題を明示的に分解することでさらに改善する。
Self-ask はより多様なデータセット（例：Bamboogle）でより大きな改善を達成し、検索エンジンと組み合わせた Self-ask + Search はさらなる精度向上をもたらす。
Self-ask および Self-ask + Search は、Least-to-Most などの代替案よりも速く、同等かそれ以上の精度を提供する。
データセットを跨いで、Self-ask + Search は Self-ask のみより一貫して精度を向上させ、Bamboogle で顕著な改善（絶対精度で約10ポイント程度）を達成する。）

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。