QUICK REVIEW

[论文解读] Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

Aojun Zhou, Ke Wang|arXiv (Cornell University)|Aug 15, 2023

Topic Modeling被引用 18

一句话总结

本文分析 GPT-4 Code Interpreter 的代码生成/执行并提出显式的基于代码的自验证（CSV）以及基于验证的加权多数投票，以提升数学问题求解，在结合 CSV 和投票后在 MATH 上达到 84.32% 的表现。

ABSTRACT

Recent progress in large language models (LLMs) like GPT-4 and PaLM-2 has brought significant advancements in addressing math reasoning problems. In particular, OpenAI's latest version of GPT-4, known as GPT-4 Code Interpreter, shows remarkable performance on challenging math datasets. In this paper, we explore the effect of code on enhancing LLMs' reasoning capability by introducing different constraints on the extit{Code Usage Frequency} of GPT-4 Code Interpreter. We found that its success can be largely attributed to its powerful skills in generating and executing code, evaluating the output of code execution, and rectifying its solution when receiving unreasonable outputs. Based on this insight, we propose a novel and effective prompting method, explicit \uline{c}ode-based \uline{s}elf-\uline{v}erification~(CSV), to further boost the mathematical reasoning potential of GPT-4 Code Interpreter. This method employs a zero-shot prompt on GPT-4 Code Interpreter to encourage it to use code to self-verify its answers. In instances where the verification state registers as ``False'', the model shall automatically amend its solution, analogous to our approach of rectifying errors during a mathematics examination. Furthermore, we recognize that the states of the verification result indicate the confidence of a solution, which can improve the effectiveness of majority voting. With GPT-4 Code Interpreter and CSV, we achieve an impressive zero-shot accuracy on MATH dataset extbf{(53.9\% $ o$ 84.3\%)}.

研究动机与目标

评估代码生成、执行和自调试如何促进 GPT-4 Code Interpreter 的数学问题求解。
调查显式的基于代码的自验证（CSV）提示是否提高正确性和鲁棒性。
开发一个基于验证状态的加权多数投票方案，以在聚合中利用验证状态。
提供新的指令遵循数据集（MATH-code、MMLU-Math-code），以帮助开源模型的微调。

提出的方法

在受限提示下对 GPT-4 Code Interpreter 的代码使用进行系统分析（no code、代码允许一次 vs. 无限制）。
引入显式的基于代码的自验证（CSV）提示，使模型生成并验证基于代码的解题过程，并在验证失败时调整推理。
实现基于验证的加权多数投票，为 True/Uncertain/False 的验证状态分配权重，以改进最终答案的选择。
在 MATH、GSM8K 和 MMLU-Math 数据集上进行评估，显示性能提升，并通过消融比较基于代码的验证与自然语言验证，以及不同的代码使用频率。
公开实验数据以实现可重复性和对开源模型的微调。

实验结果

研究问题

RQ1GPT-4 Code Interpreter 的代码生成/执行如何促进复杂数学问题的求解？
RQ2显式的基于代码的自验证（CSV）是否提升答案的准确性和可靠性？
RQ3通过利用验证状态，基于验证的加权多数投票是否能够进一步提高最终答案的准确性？
RQ4代码使用频率在不同难度水平和数据集上的对模型性能有何影响？

主要发现

GPT-4 Code 在 MATH 上结合代码使用的表现显著优于基线（69.69% 对 53.90%）。
加入显式的基于代码的自验证（CSV）使 MATH 的准确率提升至 73.54%。
将 CSV 与基于验证的加权多数投票结合，在 MATH 上达到 84.32%（k=16 路径）。
代码使用频率与准确性呈正相关，尤其在更难的问题上。
在几乎所有子主题的消融实验中，基于代码的验证优于自然语言验证。
在结合 CSV 和投票时，该方法也在 GSM8K 和 MMLU-Math 上取得了最新的领先成果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。