QUICK REVIEW

[论文解读] Magicoder: Empowering Code Generation with OSS-Instruct

Yuxiang Wei, Zhe Wang|arXiv (Cornell University)|Dec 4, 2023

Topic Modeling被引用 23

一句话总结

Magicoder 是一系列开源代码语言模型 (7B)，使用 OSS-Instruct（75k 数据）和 Evol-Instruct 进行训练，以提升在 Python、多语言和数据科学任务上的代码生成能力，在基准测试中与更大规模的模型甚至 ChatGPT 相抗衡，且有时超越它们。

ABSTRACT

We introduce Magicoder, a series of fully open-source (code, weights, and data) Large Language Models (LLMs) for code that significantly closes the gap with top code models while having no more than 7B parameters. Magicoder models are trained on 75K synthetic instruction data using OSS-Instruct, a novel approach to enlightening LLMs with open-source code snippets to generate diverse instruction data for code. Our main motivation is to mitigate the inherent bias of the synthetic data generated by LLMs through the wealth of open-source references for the production of more realistic and controllable data. The orthogonality of OSS-Instruct and other data generation methods like Evol-Instruct further enables us to build an enhanced MagicoderS. Both Magicoder and MagicoderS substantially outperform state-of-the-art code models with similar or even larger sizes on a wide range of coding benchmarks. Notably, MagicoderS-CL-7B based on CodeLlama even surpasses the prominent ChatGPT on HumanEval+ (66.5 vs. 65.9 in pass@1 ). Overall, OSS-Instruct opens a new direction for crafting diverse synthetic instruction data for code using abundant open-source references.

研究动机与目标

通过以开源代码作为指令来源，减轻用于代码生成的合成指令数据中的偏差。
从真实的开源代码片段中生成高质量、多样且可控的指令微调数据。
构建并评估一系列开源代码语言模型（Magicoder 和 Magicoder-S），在相似或更小规模下超越基线。
展示 OSS-Instruct 在跨语言和领域提高解析能力的有效性，并通过发布数据和权重来促进开放研究。

提出的方法

使用 OSS-Instruct 从种子开源代码片段生成 75K 道合成编码问题及解答。
净化 OSS-Instruct 数据，去除与标准基准的重叠。
对 CodeLlama-Python-7B 进行微调以创建 Magicoder-CL，并进一步使用 Evol-Instruct 微调以创建 Magicoder-S-CL。
在 HumanEval、MBPP、MultiPL-E 和 DS-1000 上进行评估，使用 EvalPlus 扩展（HumanEval+、MBPP+）。
在语言分布上进行消融研究，并将 OSS-Instruct 与对注释-函数对的直接微调进行对比。
开源模型权重、训练数据和代码。

Figure 1: Overview of OSS-Instruct and the pass@1 results of different LLMs on HumanEval (+)

实验结果

研究问题

RQ1OSS-Instruct 是否降低数据偏差并提升代码生成的指令微调质量？
RQ2Magicoder 与 Magicoder-S 变体在多样化基准上相对于最先进的开源模型和 ChatGPT 的表现如何？
RQ3训练数据语言分布对 Python 与多语言代码生成性能有何影响？
RQ4与语义相关但更嘈杂的数据（如注释-函数对）的直接微调相比，OSS-Instruct 是否更有效？

主要发现

Magicoder-CL 相较于基线 CodeLlama-Python-7B 有所提升，并且优于同等或更小规模的其他开源模型。
Magicoder-S-CL 进一步提升，在 HumanEval+ 上超越 ChatGPT，同时在 HumanEval 上达到或超过。
Magicoder-CL 与 Magicoder-S-CL 在 7B 参数模型中达到最先进水平，并在多语言及数据科学任务（DS-1000）上表现强劲。
Magicoder-DS 与 Magicoder-S-DS（基于 DeepSeek-Coder）达到较高的 pass@1 分数，Magicoder-S-DS 在 HumanEval 上达到 76.8，并在较少微调令牌的情况下优于一些更大基线。
OSS-Instruct 数据具有多样性且与 HumanEval 的相似度低于 Self-Instruct 或相关方法，表明改进并非由于泄漏。
对注释-函数对直接微调可能降低性能，而 OSS-Instruct 数据则带来显著收益。

Figure 2: The detailed prompt design for OSS-Instruct

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。