QUICK REVIEW

[论文解读] Using Counterfactual Tasks to Evaluate the Generality of Analogical Reasoning in Large Language Models

Martha Lewis, Melanie Mitchell|arXiv (Cornell University)|Feb 14, 2024

Topic Modeling被引用 10

一句话总结

论文创造反事实字母-字符串类比问题，以测试大语言模型是依赖一般抽象推理还是训练数据相似性，结果显示人类稳健地成功，而 GPT 模型在反事实变体上下降。

ABSTRACT

Large language models (LLMs) have performed well on several reasoning benchmarks, including ones that test analogical reasoning abilities. However, it has been debated whether they are actually performing humanlike abstract reasoning or instead employing less general processes that rely on similarity to what has been seen in their training data. Here we investigate the generality of analogy-making abilities previously claimed for LLMs (Webb, Holyoak, & Lu, 2023). We take one set of analogy problems used to evaluate LLMs and create a set of "counterfactual" variants-versions that test the same abstract reasoning abilities but that are likely dissimilar from any pre-training data. We test humans and three GPT models on both the original and counterfactual problems, and show that, while the performance of humans remains high for all the problems, the GPT models' performance declines sharply on the counterfactual set. This work provides evidence that, despite previously reported successes of LLMs on analogical reasoning, these models lack the robustness and generality of human analogy-making.

研究动机与目标

评估 LLM 是否表现出超越训练数据相似性的、类似人类的通用抽象类比推理。
测试 LLM 的类比求解鲁棒性，使用反事实字母表和非字母符号。
在原始与反事实问题上，将人类表现与 GPT-3、GPT-3.5、GPT-4 进行比较。
提供数据集和方法学以评估 LLM 类比生成的一般性。

提出的方法

通过置换大小为 n 的字母表，取值为 {0,2,5,10,20}，并添加非字母符号字母表来生成反事实类比问题。
使用 Webb 等人提出的六种变换类型以及两种概括变体来产生每个字母表大小的 420 道题，再加上未置换的情况。
在固定温度的零-shot 提示下评估人类（136 名参与者）和三个 GPT 模型（GPT-3、GPT-3.5、GPT-4）。
包含反事实理解性检查，以验证模型对后继者和前序者的理解。
分析准确率和错误类型，比较不同字母表类型和题型的表现。

实验结果

研究问题

RQ1GPT 模型在反事实字母-字符串类比中的表现是否如人类一样保持？
RQ2字母表置换或符号替换如何影响 GPT 模型的类比推理能力？
RQ3GPT 模型的类比能力是鲁棒且普遍的，还是在很大程度上依赖训练数据相似性？

主要发现

人类在原始和反事实问题上，在所有字母表类型中均保持高水平表现。
GPT 模型在原始问题上准确性高，但在反事实上下降，其中 GPT-3.5 和 GPT-4 的表现明显逊于人类。
当从标准字母表切换到置换字母表，再到符号字母表时，GPT 模型的表现下降，表明一般性有限。
GPT 模型的错误模式与人类不同，更依赖字面或错误的规则回答，而非创造性的替代规则。
总体而言，结果挑战了 GPT 模型通过一般抽象推理解决类比、与人类相当的说法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。