QUICK REVIEW

[论文解读] Humans in Humans Out: On GPT Converging Toward Common Sense in both Success and Failure

Philipp Koralus, Vincent Wang-Maścianica|arXiv (Cornell University)|Mar 30, 2023

Explainable Artificial Intelligence (XAI)被引用 11

一句话总结

论文使用 Erotetic Theory of Reason (ETR) 来评估 GPT-3、GPT-3.5 和 GPT-4，在一个包含 61 个推理与判断问题的 ETR61 基准上，发现更大的模型更符合人类常识模式（包括谬误），并显示基于 ETR 的提示可以减少某些谬误。

ABSTRACT

Increase in computational scale and fine-tuning has seen a dramatic improvement in the quality of outputs of large language models (LLMs) like GPT. Given that both GPT-3 and GPT-4 were trained on large quantities of human-generated text, we might ask to what extent their outputs reflect patterns of human thinking, both for correct and incorrect cases. The Erotetic Theory of Reason (ETR) provides a symbolic generative model of both human success and failure in thinking, across propositional, quantified, and probabilistic reasoning, as well as decision-making. We presented GPT-3, GPT-3.5, and GPT-4 with 61 central inference and judgment problems from a recent book-length presentation of ETR, consisting of experimentally verified data-points on human judgment and extrapolated data-points predicted by ETR, with correct inference patterns as well as fallacies and framing effects (the ETR61 benchmark). ETR61 includes classics like Wason's card task, illusory inferences, the decoy effect, and opportunity-cost neglect, among others. GPT-3 showed evidence of ETR-predicted outputs for 59% of these examples, rising to 77% in GPT-3.5 and 75% in GPT-4. Remarkably, the production of human-like fallacious judgments increased from 18% in GPT-3 to 33% in GPT-3.5 and 34% in GPT-4. This suggests that larger and more advanced LLMs may develop a tendency toward more human-like mistakes, as relevant thought patterns are inherent in human-produced training data. According to ETR, the same fundamental patterns are involved both in successful and unsuccessful ordinary reasoning, so that the "bad" cases could paradoxically be learned from the "good" cases. We further present preliminary evidence that ETR-inspired prompt engineering could reduce instances of these mistakes.

研究动机与目标

调查 GPT 模型在解决常识推理任务时，是否呈现由 Erotetic Theory of Reason (ETR) 预测的人类般推理模式。
评估在 ETR61 基准上，GPT-3、GPT-3.5 与 GPT-4 的表现与谬误易发性的变化。
测试基于 ETR 的提示设计是否能降低大语言模型中的谬误判断。

提出的方法

使用涵盖命题、概率和决策领域的 61 道推理与判断问题的 ETR61 基准。
在生产与查询条件下对 GPT-3、GPT-3.5、GPT-4 进行提示，以评估正确性以及对 ETR 预测结论的认同。
记录正确性，并将输出分类为正确生产、正确认同、两者皆有，或谬误。
应用统计检验（Wilcoxon 符号秩检验）比较跨代表现。
比较生产与认同，并检验与 ETR 预测的常识判断与谬误的一致性。

实验结果

研究问题

RQ1GPT-3、GPT-3.5 和 GPT-4 的输出是否与 ETR 对常识推理的预测一致？
RQ2在 ETR61 上，正确性、认同度和一致性在各 GPT 代之间如何演变？
RQ3相较于早期模型，较大的模型是否表现出更多的 ETR 预测的谬误？
RQ4简单的提示设计是否能降低 GPT 模型中的 ETR 预测谬误？

主要发现

GPT-3.5 的正确答案数量少于 GPT-3 或 GPT-4；GPT-4 的正确性和一致性显著提升。
GPT-4 与 GPT-3.5 更频繁地产生或认同 ETR 预测的常识性答案，相较于 GPT-3。
跨模型代际，谬误产出增加：生产端从 18%（GPT-3）到 34%（GPT-4）；谬误认同保持较低（18% 到 20%）。
GPT-4 整体上比 GPT-3 更易出现谬误，也比它认同的谬误更多。
基于 ETR 的提示设计减少了谬误，GPT-3.5 相较于对照提示显示统计显著的降低；效果因模型而异。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。