QUICK REVIEW

[论文解读] Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning

Antonia Creswell, Murray Shanahan|arXiv (Cornell University)|May 19, 2022

Topic Modeling被引用 110

一句话总结

作者提出 Selection-Inference (SI) 框架，该框架在一个模块化、两步流程（选择与推理）中使用预训练的大型语言模型（LLMs），以实现因果、可解释的多步逻辑推理，在无需微调的情况下显著超越 vanilla 和 Chain-of-Thought 基线，甚至在 5-shot 提示下也优于更大规模的模型。

ABSTRACT

Large language models (LLMs) have been shown to be capable of impressive few-shot generalisation to new tasks. However, they still tend to perform poorly on multi-step logical reasoning problems. Here we carry out a comprehensive evaluation of LLMs on 50 tasks that probe different aspects of logical reasoning. We show that language models tend to perform fairly well at single step inference or entailment tasks, but struggle to chain together multiple reasoning steps to solve more complex problems. In light of this, we propose a Selection-Inference (SI) framework that exploits pre-trained LLMs as general processing modules, and alternates between selection and inference to generate a series of interpretable, casual reasoning steps leading to the final answer. We show that a 7B parameter LLM used within the SI framework in a 5-shot generalisation setting, with no fine-tuning, yields a performance improvement of over 100% compared to an equivalent vanilla baseline on a suite of 10 logical reasoning tasks. The same model in the same setting even outperforms a significantly larger 280B parameter baseline on the same suite of tasks. Moreover, answers produced by the SI framework are accompanied by a causal natural-language-based reasoning trace, which has important implications for the safety and trustworthiness of the system.

研究动机与目标

评估LLMs在一组广泛的逻辑推理任务中的表现，并识别多步推理的局限性。
提出一个模块化的 Selection-Inference 框架，以通过因果线索改善推理。
在 5-shot 提示下使用 7B LLM 对比基线（包括 280B 模型）来展示 SI 框架的有效性。
展示 SI 能产生可解释、因果的推理线索，对安全、调试与信任具有帮助作用。

提出的方法

将推理分解为由来自 Gopher 家族的预训练、固定参数的 LLMs 实现的迭代选择与推理步骤。
对选择模块进行提示工程，以从上下文中对事实进行评分并为单步推理选择子集。
使用一个独立的推理模块在不访问问题的情况下，从所选子集中生成新的事实。
将多个（选择、推理）步骤串联起来，构建具有新推断事实的上下文，形成因果推理轨迹。
在 10 个逻辑任务中比较 SI 与普通 LLM、Chain-of-Thought (COT) 以及更大规模的 280B 模型的表现。

实验结果

研究问题

RQ1LLMs 在简单蕴含推理与多步逻辑推理任务中的表现如何？
RQ2是否可以在不微调的情况下，通过一个模块化的选择-推理框架提升推理准确性？
RQ3SI 产生的推理轨迹是否提供因果、可解释的论证，并能实现错误恢复？

主要发现

在 SI 框架下，7B LLM 的生成准确率为 58.75%，而同模型原生应用为 2.94%，COT 下为 41.32%（均 p<0.01）。
7B 的 SI 模型通常在普通设定（31.19%）和 COT（44.03%）下超过 280B 基线（均 p<0.01）。
在较容易的多项选择评估中，普通的 7B 模型表现优于 280B 模型（57.31% 对 51.45%），但 SI 在生成设置下仍然超过它。
SI 能以 100% 的准确率解决 bAbI 15 的推理题，仅需五个提示例子。
SI 在 ProofWriter Depth 0 和 Depth 1 任务上显示出强劲的性能（p 值显著）。
SI 产生因果的、自然语言的推理轨迹，且可通过增加新的推断事实来从错误中恢复。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。