QUICK REVIEW

[论文解读] Measuring Systematic Generalization in Neural Proof Generation with Transformers

Nicolas Gontier, Koustuv Sinha|PolyPublie (École Polytechnique de Montréal)|Sep 30, 2020

Topic Modeling参考文献 26被引用 28

一句话总结

本文研究了变换器语言模型（TLMs）在逻辑推理任务中的泛化能力，通过微调其生成一阶逻辑问题的自然语言证明。尽管在已见证明长度上表现良好，TLMs 在长度泛化方面仍表现不佳，但当在更长、更全面的证明上进行训练时，其性能显著提升，尤其是采用逆向推理策略时；而直接生成答案的模型在泛化性能上优于生成证明的模型。

ABSTRACT

We are interested in understanding how well Transformer language models (TLMs) can perform reasoning tasks when trained on knowledge encoded in the form of natural language. We investigate their systematic generalization abilities on a logical reasoning task in natural language, which involves reasoning over relationships between entities grounded in first-order logical proofs. Specifically, we perform soft theorem-proving by leveraging TLMs to generate natural language proofs. We test the generated proofs for logical consistency, along with the accuracy of the final inference. We observe length-generalization issues when evaluated on longer-than-trained sequences. However, we observe TLMs improve their generalization performance after being exposed to longer, exhaustive proofs. In addition, we discover that TLMs are able to generalize better using backward-chaining proofs compared to their forward-chaining counterparts, while they find it easier to generate forward chaining proofs. We observe that models that are not trained to generate proofs are better at generalizing to problems based on longer proofs. This suggests that Transformers have efficient internal reasoning strategies that are harder to interpret. These results highlight the systematic generalization behavior of TLMs in the context of logical reasoning, and we believe this work motivates deeper inspection of their underlying reasoning strategies.

研究动机与目标

评估 TLMs 在自然语言逻辑推理任务中的系统性泛化能力。
探究在不同证明结构（前向推理与逆向推理）上进行训练对泛化的影响。
评估证明长度以及训练目标（生成证明 vs. 直接生成答案）对泛化性能的影响。
确定 TLMs 是否学习到可复用的推理策略，还是仅依赖于训练数据中的表面模式。
探索是否能为复杂推理任务可靠地生成可解释且逻辑一致的证明。

提出的方法

在 CLUTRR 基准上微调 TLMs，该基准提供自然语言陈述和一阶逻辑证明。
通过语言建模目标，训练模型生成完整的自然语言证明。
评估模型在生成证明的逻辑一致性和最终推理准确性方面的表现。
比较不同推理策略（前向推理、逆向推理和无证明，即直接生成答案）下的泛化表现。
通过控制实验，改变证明长度，测试模型在训练分布之外的外推能力。
分析注意力模式和位置依赖性，以理解模型行为。

实验结果

研究问题

RQ1TLMs 是否能够系统性地泛化到比训练时见过的更长的证明序列？
RQ2与短证明相比，在更长、更全面的证明上进行训练是否能提升泛化性能？
RQ3与前向推理相比，逆向推理是否在泛化方面更有效？
RQ4与直接预测答案相比，训练生成证明是否能带来更好的泛化？
RQ5答案在证明序列中的位置如何影响模型的泛化能力和推理可靠性？

主要发现

TLMs 显现出严重的长度泛化失败，难以泛化到比训练时见过的更长的证明序列。
在更长、更全面的证明上微调的模型，其泛化能力显著优于在短证明上训练的模型。
尽管生成难度更高，但逆向推理证明带来的泛化性能优于前向推理证明。
直接生成答案（不生成证明）的模型在泛化方面优于生成证明的模型，表明推理与解释之间存在解耦。
答案在证明序列中的位置显著影响性能，其中逆向推理将答案置于开头，模型对此类结构的处理更可靠。
生成证明的逻辑一致性经常被破坏，表明模型可能生成看似合理但无效的推理链。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。