QUICK REVIEW

[论文解读] Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns|arXiv (Cornell University)|Mar 5, 2021

Topic Modeling参考文献 40被引用 272

一句话总结

本文介绍 MATH，一个 12,500-problem 基准，用于测量机器数学问题求解能力并给出逐步解题过程，以及 AMPS，一个大型预训练语料库以提升数学推理。结果显示当前模型难以应对，单纯扩大规模不足以解决问题。

ABSTRACT

Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue. While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community.

研究动机与目标

评估机器学习模型在多样化的竞赛风格数学题目上的数学问题求解能力。
提供一个大规模、可解释的数据集，包含完整的逐步解题过程，以辅助学习与评估。
引入 AMPS，一个用于在广泛数学主题上教授基础知识的预训练语料库。
评估模型规模、预训练与逐步解题过程对 MATH 表现的影响。

提出的方法

从 AMC、AIME 及相关竞赛中创建 MATH，涵盖七个科目和难度等级 1–5；要求最终盒装答案的精确匹配评分。
为每道题提供完整的逐步解题过程，以支持学习与可解释性。
从 Khan Academy 与 Mathematica 生成的问题中开发 LaTeX 格式化解的 AMPS 预训练语料库。
在 AMPS 上对自回归模型进行预训练，然后在 MATH 上以混合最终答案与完整解题目标进行微调。
在不同设置下评估模型（GPT-2/3），包括是否使用 AMPS 预训练、是否使用逐步推理空间、以及是否提供部分解题提示。
分析在训练与推理阶段包含逐步解题过程对性能、置信/错误检测的影响，以及其作用。

实验结果

研究问题

RQ1当前语言模型在解决高中竞赛风格的数学题方面的表现如何？
RQ2相较于仅靠扩展规模，AMPS 预训练是否能显著提升数学问题求解能力？
RQ3逐步解题过程是否可以作为有用的 scratch space，在哪些条件下？
RQ4提供部分或完整的逐步解题过程对模型准确性有何影响？
RQ5扩展 Transformer 的规模是否足以在 MATH 上实现高准确性，还是需要算法层面的创新？

主要发现

即使对大规模变换器（如 GPT-3 175B）也，MATH 上的模型准确率仍然很低（平均5.2%）。
AMPS 预训练使一个 0.1B 的模型达到一个微调后 13B 模型的性能水平，显示出数据效率。
在该设置下，AMPS 预训练优于基于 Math StackExchange 数据的预训练。
推理阶段生成逐步解题过程可能降低准确性，表明 scratch-space 行为可能干扰进展。
用逐步解题过程训练或使用部分真实解可提升性能，相较仅使用题目和最终答案。
即使进行大规模扩展，想达到 40% 的准确性也需要不切实际的参数规模，表明需要新的算法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。