QUICK REVIEW

[论文解读] Measuring and Narrowing the Compositionality Gap in Language Models

Ofir Press, Muru Zhang|arXiv (Cornell University)|Oct 7, 2022

Topic Modeling被引用 22

一句话总结

该论文定义并衡量语言模型中的组成性差距，显示该差距在规模扩大时并未缩小，并引入 elicitive prompting（chain-of-thought 与 self-ask）以及 self-ask + search engine 的方法来缩小差距并提升多跳 QA 的性能。

ABSTRACT

We investigate the ability of language models to perform compositional reasoning tasks where the overall solution depends on correctly composing the answers to sub-problems. We measure how often models can correctly answer all sub-problems but not generate the overall solution, a ratio we call the compositionality gap. We evaluate this ratio by asking multi-hop questions with answers that require composing multiple facts unlikely to have been observed together during pretraining. In the GPT-3 family of models, as model size increases we show that the single-hop question answering performance improves faster than the multi-hop performance does, therefore the compositionality gap does not decrease. This surprising result suggests that while more powerful models memorize and recall more factual knowledge, they show no corresponding improvement in their ability to perform this kind of compositional reasoning. We then demonstrate how elicitive prompting (such as chain of thought) narrows the compositionality gap by reasoning explicitly. We present a new method, self-ask, that further improves on chain of thought. In our method, the model explicitly asks itself (and answers) follow-up questions before answering the initial question. We finally show that self-ask's structured prompting lets us easily plug in a search engine to answer the follow-up questions, which additionally improves accuracy.

研究动机与目标

量化语言模型在正确回答所有子问题后是否仍未正确回答组成性问题的比例（组成性差距）。
考察模型规模/尺度对组成性推理性能的影响。
开发 elicitive prompting 方法以缩小差距并改善多跳问答。
提供实用的 prompting 与检索策略以提升组成性 QA。

提出的方法

创建一个二跳组成数据集（Compositional Celebrities, CC）以测量组成性差距。
在 CC 上评估 GPT-3 家族模型以评估差距随模型规模和 prompting 风格的变化。
引入 elicitive prompting（chain of thought）以及一种新颖的 self-ask prompting，将问题分解为后续子问题。
把 self-ask 与搜索引擎结合（Self-ask + Search）通过检索来回答子问题。
在多个数据集（CC、2WikiMultiHopQA、Musique、Bamboogle）上对比直接 prompting、chain of thought 以及简单搜索基线。
在可用情况下报告准确率与效率（每个回答的 token 数量）。

实验结果

研究问题

RQ1随着语言模型规模的增加，在2 hop组成性问题上的组成性差距是否会缩小？
RQ2相较于直接 prompting 或标准 chain-of-thought，elicitive prompts 能否降低组成性差距？
RQ3将搜索引擎集成到 self-ask 中是否进一步提升组成性问答？
RQ4子答案的模型置信度（perplexity）与组成性成功之间有何关系？
RQ5所提方法在 CC 之外的多个组成性 QA 数据集上表现如何？

主要发现

Table Headers (Main results)
Bamb. (Bamboogle)	2Wiki Multi-Hop QA	Musique	Direct prompting	17.6	25.4	5.6
Chain of Thought	46.4	29.8	12.6
Search	0.0	2.2	1.5
Search + postproc.	-	26.3	6.5
Self-ask	57.6	30.0	13.8
Self-ask + Search	60.0	40.1	15.2

组成性差距在 GPT-3 模型规模和 prompting 变化之间大致保持在约 40% 左右的水平，并未随规模而下降。
子问题可以高准确率回答，而最终的组成性答案滞后，显示对记忆的依赖高于对稳健组成的能力。
elicitive prompting（chain-of-thought）相较直接 prompting 提升了对组成性问题的表现，但 self-ask 通过显式分解问题进一步改进结果。
Self-ask 在更为多样的数据集（如 Bamboogle）上得到更大改进，当与搜索引擎结合（Self-ask + Search）时，准确性进一步提升。
Self-ask 与 Self-ask + Search 的速度比某些替代方法（如 Least-to-Most）更快，同时提供相当或更好的准确性。
在各数据集上，Self-ask + Search 相较于单独的 Self-ask 常常提高准确性，在 Bamboogle 上特别显著（绝对准确性最高可提升约 10 个百分点）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。