QUICK REVIEW

[论文解读] Mathematics, word problems, common sense, and artificial intelligence

Ernest Davis|arXiv (Cornell University)|Jan 23, 2023

Mathematics, Computing, and Information Processing被引用 8

一句话总结

本论文分析当前AI能力与局限，特别是大型语言模型，在解决需要常识与世界知识的数学文字题方面，并评估方法、基准和实验发现。

ABSTRACT

The paper discusses the capacities and limitations of current artificial intelligence (AI) technology to solve word problems that combine elementary knowledge with commonsense reasoning. No existing AI systems can solve these reliably. We review three approaches that have been developed, using AI natural language technology: outputting the answer directly, outputting a computer program that solves the problem, and outputting a formalized representation that can be input to an automated theorem verifier. We review some benchmarks that have been developed to evaluate these systems and some experimental studies. We discuss the limitations of the existing technology at solving these kinds of problems. We argue that it is not clear whether these kinds of limitations will be important in developing AI technology for pure mathematical research, but that they will be important in applications of mathematics, and may well be important in developing programs capable of reading and understanding mathematical content written by humans.

研究动机与目标

澄清将初等数学与世界知识与常识相结合的数学题型。
评估当前AI方法（语言模型、代码生成、形式化）在这些题目上的表现。
回顾基准与实验，评估AI在常识性文字题上的表现。
讨论AI局限性对AI驱动的数学教育和可读的数学内容的人类理解的影响。

提出的方法

将数学题分为类别：符号化、文字题、现实世界文字题、常识性文字题（CSW）和初等CSW。
描述三种解决文字题的AI方法：直接给出答案、将题意转化为代码以求解、以及自动形式化以将正式规范输入给验证器。
总结大型语言模型（LLM）的特征，包括训练、提示，以及如幻觉等局限性。
回顾基准（SVAMP、Līla）及基准问题，包括数据质量担忧和未测试的能力。
呈现实证文献中关于LLM在题目类别上的表现，并比较IID与OOD设定。

Mathematics, word problems, common sense, and artificial intelligence

实验结果

研究问题

RQ1当前AI技术解决结合数学与常识推理的文字题的能力与局限性如何？
RQ2三种AI方法（直接答案、代码生成、形式化）在常识性数学文字题上的表现如何？
RQ3存在哪些用于评估AI在数学文字题上表现的基准，它们揭示了当前能力与差距？
RQ4AI局限性对数学教育应用以及人类撰写的数学内容的理解有何影响？

主要发现

LLMs在语言任务上表现良好，但在需要真实世界知识与数学整合的常识性文字题上，可靠性存在困难。
代码生成方法（如Codex）能够将文字题转化为可执行代码，但可能依赖训练集模式，在非平常情形或题意变化时表现欠佳。
将文本翻译为Isabelle形式化证明的自动化形式化在有限程度上成功（在测试案例中约25%的完美翻译）。
基准结果在不同类别间差异较大；基础数学表现可能高于几何、微积分等综合类别，在分布外设定下尤为显著。
AI在数学能力、常识推理和可靠处理正式数学内容之间存在诊断性缺口，影响教育应用和对数学内容的理解。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。