QUICK REVIEW

[论文解读] Large Language Models Can Be Easily Distracted by Irrelevant Context

Freda Shi, Xinyun Chen|arXiv (Cornell University)|Jan 31, 2023

Topic Modeling被引用 103

一句话总结

论文引入 GSM-IC，一种用于算术推理的干扰性基准，显示无关上下文会显著损害提示方法；自一致性和指示性提示可以缓解但无法消除这一问题。

ABSTRACT

Large language models have achieved impressive performance on various natural language processing tasks. However, so far they have been evaluated primarily on benchmarks where all information in the input context is relevant for solving the task. In this work, we investigate the distractibility of large language models, i.e., how the model problem-solving accuracy can be influenced by irrelevant context. In particular, we introduce Grade-School Math with Irrelevant Context (GSM-IC), an arithmetic reasoning dataset with irrelevant information in the problem description. We use this benchmark to measure the distractibility of cutting-edge prompting techniques for large language models, and find that the model performance is dramatically decreased when irrelevant information is included. We also identify several approaches for mitigating this deficiency, such as decoding with self-consistency and adding to the prompt an instruction that tells the language model to ignore the irrelevant information.

研究动机与目标

在现实、易分心的输入情境下激励对大语言模型的评估，其中并非所有信息都相关。
构建 GSM-IC：一个源自 GSM8K 的数据集，插入不相关的句子以衡量模型对敏感度。
评估 GSM-IC 上的最先进提示技术并量化跨模型的分心程度。
确定提升对无关上下文鲁棒性的缓解策略（如自我一致性、干扰示例、忽略上下文的指令）。

提出的方法

通过在基础 GSM8K 问题中添加无关句子，而不改变正确解，来创建 GSM-IC。
使用 code-davinci-002 和 text-davinci-003，在 GSM-IC 上对提示技术（推理链 CoT、0-CoT、LtM、Program）进行有无自我一致性的评估。
分析提示设计，包括带干扰项的示例和基于指令的忽略无关上下文的提示。
测量微观、宏观和归一化准确性，以量化分心性和鲁棒性。
进行分解分析，识别无关上下文的因素（主题重叠、角色名重叠、数字范围）及其影响。
将评估扩展到 DROP，使用足球示例来测试在更长上下文中的鲁棒性。

Figure 1: Illustration of the considered factors when creating the GSM-IC dataset. Best viewed in color.

实验结果

研究问题

RQ1无关上下文信息的包含如何影响当前提示技术在算术推理任务上的准确性？
RQ2提示策略（自我一致性、带干扰项的示例以及忽略上下文的指令）是否能缓解由无关信息引起的分心？
RQ3无关上下文的哪些因素最影响模型表现，以及模型架构或提示风格如何调节这种敏感性？
RQ4GSM-IC 的鲁棒性改进是否会转移到其他数据集/任务（如 DROP）以及不同的模型家族？

主要发现

所有研究的提示技术对无关信息敏感，宏观准确率显著下降（始终解决的问题少于 30%）。
自我一致性显著提升鲁棒性；对于某些提示，每题 20 个样本中，正确答案在 99.7% 的样本中出现。
示例干扰项和忽略上下文指令在提示与模型间持续提升鲁棒性。
LtM 在微观准确性方面通常对无关上下文最鲁棒，但宏观收益随模型与提示设定而异。
分解分析显示主题重叠和同一主题内的干扰项对宏观准确性危害最大；数字本身对宏观的影响不如与原始题目在词汇层面的重叠。
指令化提示（例如告知模型忽略无关信息）带来显著提升，指令类型很关键（明确忽略上下文的指令至关重要）。
在 DROP 数据集上，LtM 及其指令变体带来改进，表明其适用性不仅限于 GSM-IC。

Figure 2: Prompt formats for the investigated techniques on the right, which are constructed from building blocks on the left (best viewed in color). The [Problem with Irrelevant Context] is obtained by adding an irrelevant sentence ( italic and underlined ) to the original problem description and i

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。