QUICK REVIEW

[論文レビュー] WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

Haipeng Luo, Qingfeng Sun|arXiv (Cornell University)|Aug 18, 2023

Topic Modeling被引用数 27

ひとこと要約

WizardMath is an open-source Llama-2–based model trained with Reinforcement Learning from Evol-Instruct Feedback (RLEIF) to achieve state-of-the-art mathematical reasoning on GSM8k and MATH, surpassing many open-source models and some closed-source ones on these benchmarks.

ABSTRACT

Large language models (LLMs), such as GPT-4, have shown remarkable performance in natural language processing (NLP) tasks, including challenging mathematical reasoning. However, most existing open-source models are only pre-trained on large-scale internet data and without math-related optimization. In this paper, we present WizardMath, which enhances the mathematical CoT reasoning abilities of LLMs without using external python tools, by applying our proposed Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method to the domain of math. Through extensive experiments on two mathematical reasoning benchmarks, namely GSM8k and MATH, we reveal the extraordinary capabilities of our model. Remarkably, WizardMath-Mistral 7B surpasses top-tier open-source LLMs by a substantial margin with higher data efficiency. Furthermore, WizardMath 70B even outperforms GPT-3.5-Turbo, Claude 2, Gemini Pro and GPT-4-early-version. Additionally, our preliminary exploration highlights the pivotal role of instruction evolution and process supervision in achieving exceptional math performance. For more details refer to https://github.com/nlpxucan/WizardLM

研究の動機と目的

より良いオープンソースの数学的推論がLLMで必要であることを動機づける。
Evol-Instruct、instruction reward modeling、process-supervised rewardsを組み合わせた新しいトレーニングフレームワーク（RLEIF）を提案する。
オープンソースおよび一部クローズドソースモデルと比較してGSM8kおよびMATHで最先端の性能を実証する。

提案手法

ステップバイステップの数学解法を含む教師あり指示追従データでLlama-2をファインチューニングする。
多様で徐々に難易度が高く/低くなる数学指示を生成する Evol-Instruct を開発する（downward and upward evolution）。
二つの報酬モデルを訓練する：指示品質のための Instruction Reward Model (IRM) と、各ステップの解答フィードバックのための Process-supervised Reward Model (PRM)。
最終報酬 r = rI × rA を用いて Proximal Policy Optimization (PPO) を適用し、 evolucionデータで強化学習を行う。

実験結果

リサーチクエスチョン

RQ1RLEIF および Evol-Instruct ベースのデータ拡張は、オープンソースのLLMの数学的推論を改善し、ベースラインのオープンソースモデルを超えることができるか。
RQ2WizardMath は GSM8k および MATH で、クローズソースおよび他のオープンソースモデルと比較してどのような性能を示すか。
RQ3RLEIF フレームワーク下でモデルサイズを拡大する（7B、13B、70B）ことは GSM8k および MATH の性能にどのような影響を与えるか。

主な発見

モデル	パラメータ	GSM8k	MATH
WizardMath	7B	54.9 (+3.3)	10.7 (+7.7)
WizardMath	13B	63.9 (+35.2)	14.0 (+10.1)
WizardMath	70B	81.6 (+24.8)	22.7 (+9.2)

WizardMath 70B は GSM8k で 81.6 pass@1 を達成し、ベースラインの 56.8 から +24.8 の改善を示す。
WizardMath 70B は MATH で 22.7 pass@1 を達成し、ベースラインの 13.5 から +9.2 の改善を示す。
WizardMath 13B は GSM8k で 63.9 pass@1 を達成し、ベースラインの 28.7 から +35.2 の改善を示す。
WizardMath 13B は MATH で 14.0 pass@1 を達成し、ベースラインの 3.9 から +10.1 の改善を示す。
WizardMath 7B は GSM8k で 54.9 pass@1 を達成し、ベースラインの 51.6 から +3.3 の改善を示す、MATH では 10.7 を達成し、ベースラインの 2.9 から +7.7 の改善を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。