QUICK REVIEW

[论文解读] Analysing Mathematical Reasoning Abilities of Neural Models

David Saxton, Edward Grefenstette|arXiv (Cornell University)|Apr 2, 2019

Topic Modeling参考文献 28被引用 87

一句话总结

该论文引入一个大型、程序化生成的数据集，用于评估神经序列到序列模型在代数与符号推理方面的能力，比较循环神经网络与 Transformer 架构并分析泛化。研究指出 Transformer 模型通常优于循环模型，外推、中间计算和真正的算法推理仍然是当前模型的挑战。

ABSTRACT

Mathematical reasoning---a core ability within human intelligence---presents some unique challenges as a domain: we do not come to understand and solve mathematical problems primarily on the back of experience and evidence, but on the basis of inferring, learning, and exploiting laws, axioms, and symbol manipulation rules. In this paper, we present a new challenge for the evaluation (and eventually the design) of neural architectures and similar system, developing a task suite of mathematics problems involving sequential questions and answers in a free-form textual input/output format. The structured nature of the mathematics domain, covering arithmetic, algebra, probability and calculus, enables the construction of training and test splits designed to clearly illuminate the capabilities and failure-modes of different architectures, as well as evaluate their ability to compose and relate knowledge and learned processes. Having described the data generation process and its potential future expansions, we conduct a comprehensive analysis of models from two broad classes of the most powerful sequence-to-sequence architectures and find notable differences in their ability to resolve mathematical problems and generalize their knowledge.

研究动机与目标

创建一个可扩展的自由形式、基于文本的数学问题数据集，以探测神经推理和符号操作。
评估最先进的序列模型在不同问题类型上的泛化能力，以及在更难的外推场景中的泛化表现。
识别在代数泛化和子程序组合方面，模型的优点、缺点及失败模式。

提出的方法

在模块（代数、算术、微积分、概率等）中以程序化方式生成多样化的数学问题。
将问题和答案表示为自由形式的字符序列，以实现广泛的表达能力。
在输入–输出生成答案方面评估两大类模型（循环结构与 Transformer）。
实现一个编码器–解码器结构（对 LSTM 使用基于注意力的机制，完整的 Transformer）并进行自回归字符级解码。
使用固定的计算预算（思考步骤）和超参数搜索来比较不同架构的性能。
通过对每个问题在插值与外推测试集上的精确字符串匹配来评分答案（0 或 1）。

实验结果

研究问题

RQ1在自由形式的输入/输出下，神经序列模型是否能够学习并泛化数学推理到多个主题？
RQ2循环模型与 Transformer 在符号数学方面的相对优势与失败模式是什么？
RQ3模型在训练期间未见過的更困难或更大规模的问题上在多大程度上能泛化（外推）？
RQ4模型在解决组合问题时，是依赖于浅层启发式策略，还是表现出类似代数泛化的能力？

主要发现

Transformer 在许多模块上的平均准确率高于循环模型，尤其在足够的思考步骤后。
Relational Memory Cores 未能胜过 LSTMs，且数据效率可能更低。
带注意力的 LSTMs 相较于简单的 LSTMs 有所提升，但在不同任务上的增益各异；增加思考步骤对某些模型有帮助。
多项式操作与混合算术显著更难，在一些多项式任务中 Transformer 展现出优势。
外推性能受限，表明模型在真正的代数泛化方面超出训练分布的能力有限。
在真实考试题目上，Transformer 模型取得 14/40，大致为 E 等级，凸显基准任务与现实世界数学测试之间的差距。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。