QUICK REVIEW

[论文解读] ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

Zhibin Gou, Shao Zhi-hong|arXiv (Cornell University)|Sep 29, 2023

Topic Modeling被引用 20

一句话总结

ToRA 将自然语言推理与基于工具的计算结合起来，解决数学问题，在10个数据集上实现开源模型的最先进结果，并在关键基准上媲美甚至超越某些闭源模型。

ABSTRACT

Large language models have made significant progress in various language tasks, yet they still struggle with complex mathematics. In this paper, we propose ToRA a series of Tool-integrated Reasoning Agents designed to solve challenging mathematical problems by seamlessly integrating natural language reasoning with the utilization of external tools (e.g., computation libraries and symbolic solvers), thereby amalgamating the analytical prowess of language and the computational efficiency of tools. To train ToRA, we curate interactive tool-use trajectories on mathematical datasets, apply imitation learning on the annotations, and propose output space shaping to further refine models' reasoning behavior. As a result, ToRA models significantly outperform open-source models on 10 mathematical reasoning datasets across all scales with 13%-19% absolute improvements on average. Notably, ToRA-7B reaches 44.6% on the competition-level dataset MATH, surpassing the best open-source model WizardMath-70B by 22% absolute. ToRA-Code-34B is also the first open-source model that achieves an accuracy exceeding 50% on MATH, which significantly outperforms GPT-4's CoT result, and is competitive with GPT-4 solving problems with programs. Additionally, we conduct a comprehensive analysis of the benefits and remaining challenges of tool interaction for mathematical reasoning, providing valuable insights for future research.

研究动机与目标

通过将推理与外部工具耦合来推动开放源模型执行高级数学推理。
策划交互式工具使用轨迹，并通过模仿学习和输出空间塑形来训练模型。
证明将推理过程与程序化工具使用交错结合能显著优于以往方法。

提出的方法

设计将自然语言推理与基于程序的工具使用交错的推理格式（交错 r 和 a，工具输出 o）。
使用 GPT-4 在 GSM8k 与 MATH 上收集交互式工具使用轨迹，以创建 ToRA-Corpus。
在 ToRA-Corpus 上通过模仿学习训练模型，以在给定问题时预测下一个推理/程序/输出。
通过使用教师模型对轨迹进行采样和纠正来应用输出空间塑形，以实现工具使用行为的多样化和纠正。
对 LLaMA-2 和 CodeLLaMA 模型在 7B–70B 参数范围内进行微调，得到 ToRA 与 ToRA-Code 系列。
在 10 个数学推理数据集上进行评估（GSM8k, MATH, GSM-Hard, SVAMP, TabMWP, ASDiv, SingleEQ, SingleOP, AddSub, MultiArith）。

实验结果

研究问题

RQ1将自然语言推理与基于程序的工具使用交错结合是否能提升开源大型语言模型在数学推理上的能力？
RQ2模仿学习加输出空间塑形能否缩小与闭源模型在标准数学基准上的差距？
RQ3工具集成如何影响不同模型规模（7B–70B）及问题领域的性能？
RQ4在基于工具的数学推理中，主要的失败模式与挑战是什么？

主要发现

Model	Size	Tools	ZS	GSM8k	MATH	GSM-Hard	SVAMP	TabMWP	ASDiv	MAWPS	AVG
GPT-4	-	-	✗	92.0	42.5	64.7	93.1	67.1	91.3	97.6	78.3
GPT-4 (PAL)	-	✓	✗	94.2	51.8	77.6	94.8	95.9	92.6	97.7	86.4
ChatGPT	-	-	✗	80.8	35.5	55.9	83.0	69.1	87.3	94.6	72.3
ChatGPT (PAL)	-	✓	✗	78.6	38.7	67.6	77.8	79.9	81.0	89.4	73.3
WizardMath	7B	-	✗	54.9	10.7	20.6	57.3	38.1	59.1	73.7	44.9
ToRA	7B	✓	✓	68.8	40.1	54.6	68.2	42.4	73.9	88.8	62.4
ToRA-Code	7B	✓	✓	72.6	44.6	56.0	70.4	51.6	78.7	91.3	66.5 (+19)
LLaMA-2 13B	13B	-	✗	24.3	6.3	13.6	43.1	39.5	56.3	70.4	36.2
ToRA 13B	13B	✓	✓	72.7	43.0	57.3	72.9	47.2	77.2	91.3	65.9
ToRA-Code 13B	13B	✓	✓	75.8	48.1	60.5	75.7	65.4	81.4	92.5	71.3 (+5.4)
ToRA 34B	34B	✓	✓	80.7	50.8	63.7	80.5	70.5	84.2	93.3	74.8
ToRA-Code 34B	34B	✓	✓	80.7	50.8	63.7	80.5	70.5	84.2	93.3	74.8 (+14)
ToRA 70B	70B	✓	✓	84.3	49.7	67.2	82.7	74.0	86.8	93.8	76.9 (+13)
ToRA-Code 70B	70B	✓	✓	84.3	49.7	67.2	82.7	74.0	86.8	93.8	76.9 (+13)

ToRA 与 ToRA-Code 在所有规模上的10个数学数据集上持续超越此前的开源模型，平均绝对增益为 13%-19%。
ToRA-70B 在 MATH 上比 WizardMath-70B 高出 22% 的绝对值，并且与使用代码的 GPT-4 解决方案相当。
ToRA-Code-34B 在 MATH 数据集上的准确率超过 50%，超过 GPT-4 CoT 结果，并与带代码的 GPT-4 竞争。
输出空间塑形（采样与纠正）带来显著增益，尤其对较小的模型，并使 MATH 的准确率提升达到最高 4.5% 的绝对值。
交错推理格式（推理/理由 + 程序 + 工具输出）持续优于仅推理或仅程序的基线，在代数与预备微积分等子领域取得显著提升。
ToRA 实现快速零-shot 推理，平均每个问题 1.02 次工具交互轮次。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。