QUICK REVIEW

[论文解读] WizardCoder: Empowering Code Large Language Models with Evol-Instruct

Ziyang Luo, Can Xu|arXiv (Cornell University)|Jun 14, 2023

Topic Modeling被引用 82

一句话总结

WizardCoder 使用面向代码的 Evol-Instruct 对 Code LLM（StarCoder）进行微调，在开源 Code LLMs 中达到最先进的结果，并在 HumanEval、HumanEval+ 和 MBPP 上超越一些闭源模型。

ABSTRACT

Code Large Language Models (Code LLMs), such as StarCoder, have demonstrated exceptional performance in code-related tasks. However, most existing models are solely pre-trained on extensive raw code data without instruction fine-tuning. In this paper, we introduce WizardCoder, which empowers Code LLMs with complex instruction fine-tuning, by adapting the Evol-Instruct method to the domain of code. Through comprehensive experiments on four prominent code generation benchmarks, namely HumanEval, HumanEval+, MBPP, and DS-1000, we unveil the exceptional capabilities of our model. It surpasses all other open-source Code LLMs by a substantial margin. Moreover, our model even outperforms the largest closed LLMs, Anthropic's Claude and Google's Bard, on HumanEval and HumanEval+. Our code, model weights, and data are public at https://github.com/nlpxucan/WizardLM

研究动机与目标

通过面向代码任务的细粒度指令微调来提升 Code LLMs 的能力。
利用 Evol-Instruct 生成更复杂、多样、聚焦代码的指令数据。
展示经过增强的指令微调能在代码生成基准测试上优于基线。

提出的方法

通过改进提示、增加面向代码的约束（调试、时间/空间复杂度）来将 Evol-Instruct 适配到代码领域。
从 StarCoder 15B 开始，演化 Code Alpaca 数据至约 78k 条样本。
在演化数据上对 StarCoder 进行 200 次训练步、批量大小 512、序列长度 2048、学习率 2e-5、fp16。
使用 HumanEval、HumanEval+、MBPP 和 DS-1000 进行评估，采用贪心解码和标准提示。
与开源和闭源基线进行对比，以评估 pass@1 和 DS-1000 分数。

实验结果

研究问题

RQ1面向代码的 Evol-Instruct 如何影响 Code LLM 在标准基准上的表现？
RQ2WizardCoder 是否在代码生成任务中缩小了与闭源模型的差距？
RQ3数据演化轮数对 pass@1 性能有何影响？

主要发现

模型	参数	HumanEval	MBPP
LaMDA	137B	14.0	-
AlphaCode	1.1B	17.1	-
PaLM	540B	26.2	36.8
PaLM-Coder	540B	36.0	47.0
PaLM 2-S	-	37.6	50.0
Codex	2.5B	21.4	-
Codex	12B	28.8	-
Code-Cushman-001	-	33.5	45.9
Code-Davinci-002	-	47.0	58.1
GPT-3.5	-	48.1	-
GPT-4	-	67.0	-
LLaMA	33B	21.7	30.2
LLaMA	65B	23.7	37.7
CodeGen-Multi	16B	18.3	20.9
CodeGen-Mono	16B	29.3	35.3
CodeGeeX	13B	22.9	24.4
StarCoder	15B	33.6	43.6 *
CodeT5+	16B	30.9	-
InstructCodeT5+	16B	35.0	-
WizardCoder	15B	57.3 (+22.3)	51.8 (+8.2)

WizardCoder 在四个基准（HumanEval、HumanEval+、MBPP、DS-1000）上成为开源 Code LLM 的 SOTA。
在 HumanEval 上，pass@1 相较基线开源模型提高了 +22.3 点（57.3 对 35.0）。
在 MBPP 上，pass@1 相较基线提高了 +8.2 点（51.8 对 43.6）。
WizardCoder 在 HumanEval 和 HumanEval+ 上的表现优于 Claude 和 Bard，尽管模型规模较小。
三轮 Evol-Instruct 数据演化获得了 HumanEval 最高的 pass@1，指导数据选择。
WizardCoder 在大多数库上展现出强劲的 DS-1000 表现。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。