[论文解读] Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models
tldr: 该论文表明,最先进的大型语言模型(LLMs)在一个简单的常识性 AIW 问题上常常失败,许多模型过度自信地给出错误答案并生成编造式推理,这动摇了对其稳健推理能力的主张。
Large Language Models (LLMs) are often described as being instances of foundation models - that is, models that transfer strongly across various tasks and conditions in few-show or zero-shot manner, while exhibiting scaling laws that predict function improvement when increasing the pre-training scale. These claims of excelling in different functions and tasks rely on measurements taken across various sets of standardized benchmarks showing high scores for such models. We demonstrate here a dramatic breakdown of function and reasoning capabilities of state-of-the-art models trained at the largest available scales which claim strong function, using a simple, short, conventional common sense problem (AIW problem) formulated in concise natural language, easily solvable by humans. The breakdown is dramatic, as models show strong fluctuations across even slight problem variations that should not affect problem solving, also expressing strong overconfidence in the wrong solutions, often backed up by plausible sounding explanation-like confabulations. Various standard interventions in an attempt to get the right solution, like various type of enhanced prompting, or urging the models to reconsider the wrong solutions again by multi step re-evaluation, fail. We take these initial observations to the scientific and technological community to stimulate urgent re-assessment of the claimed capabilities of current generation of LLMs. Such re-assessment also requires common action to create standardized benchmarks that would allow proper detection of such basic reasoning deficits that obviously manage to remain undiscovered by current state-of-the-art evaluation procedures and benchmarks. Code for reproducing experiments in the paper and raw experiments data can be found at https://github.com/LAION-AI/AIW
研究动机与目标
- 证明一个简单的常识性问题在当前的 SOTA LLMs 上也能造成崩溃,尽管它们在基准测试中得分很高。
- 评估不同提示类型如何影响模型在 AIW 问题上的表现。
- 比较广泛的闭源和开源权重 LLMs 的表现,以评估其被声称的推理能力。
提出的方法
- 引入 AIW 问题:"爱丽丝有 N 个兄弟,她也有 M 个姐妹。爱丽丝的兄弟有多少个姐妹?" 以及四个 AIW 变体。
- 通过强制可解析的最终答案格式和模型置信度来评估模型回答;将结果视为伯努利试验,以估计 p(正确回答概率)的 Beta-Binomial 分布。
- 使用 STANDARD、THINKING 与 RESTRICTED 提示类型来研究不同模型的回答质量和推理行为。
- 通过 API 托管和本地托管设置,对从小型到大规模的广泛 SOTA 模型(开源与闭源权重)进行测试;每个变体和提示类型至少收集 30 次试验。

实验结果
研究问题
- RQ1当前的 SOTA LLMs 是否能可靠地用简洁的自然语言提示解决 AIW 的不同变体?
- RQ2提示类型(STANDARD、THINKING、RESTRICTED)如何影响正确回答率和感知的推理质量?
- RQ3标准化基准(例如 MMLU)与像 AIW 这样的简单推理任务的表现之间是否存在错配?
- RQ4更大规模的模型在 AIW 及 AIW+ 变体方面是否比小模型更具鲁棒性?
- RQ5AIW+ 或重表述(SQL、参数化形式)是否能揭示推理与泛化的更深层次弱点?
主要发现
- 大多数 SOTA LLMs 在 AIW 问题上表现出显著崩溃,许多模型的正确回答率未超过 0.2。
- GPT-4 和 Claude 3 Opus 是值得注意的例外,取得较高的正确回答率,但在多次试验中仍有失败。
- AIW+ 进一步使所有测试模型的表现崩溃,包含 GPT-4 与 Claude 3 Opus。
- 高标准基准分数(如 MMLU)与 AIW 表现之间存在强烈的标定偏差,这削弱了跨基准的比较。
- 错误答案伴随过度自信和杜撰推理,在多轮提示中模型很少修正错误解。
- 较小规模的模型对 AIW 尤为脆弱,而一些较大模型偶尔表现出正确推理,但在变体中表现并不稳健。

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。