QUICK REVIEW

[论文解读] The Stack: 3 TB of permissively licensed source code

Denis Kocetkov, Raymond Li|arXiv (Cornell University)|Nov 20, 2022

Topic Modeling被引用 38

一句话总结

The Stack 是一个跨 30 种语言的 3.1 TB 开源许可代码数据集，近似去重可提升代码模型性能；经过去重与净化的宽许可数据可以达到甚至超越先前的文本到代码结果。

ABSTRACT

Large Language Models (LLMs) play an ever-increasing role in the field of Artificial Intelligence (AI)--not only for natural language processing but also for code understanding and generation. To stimulate open and responsible research on LLMs for code, we introduce The Stack, a 3.1 TB dataset consisting of permissively licensed source code in 30 programming languages. We describe how we collect the full dataset, construct a permissively licensed subset, present a data governance plan, discuss limitations, and show promising results on text2code benchmarks by training 350M-parameter decoders on different Python subsets. We find that (1) near-deduplicating the data significantly boosts performance across all experiments, and (2) it is possible to match previously reported HumanEval and MBPP performance using only permissively licensed data. We make the dataset available at https://hf.co/BigCode, provide a tool called "Am I in The Stack" (https://hf.co/spaces/bigcode/in-the-stack) for developers to search The Stack for copies of their code, and provide a process for code to be removed from the dataset by following the instructions at https://www.bigcode-project.org/docs/about/the-stack/.

研究动机与目标

介绍 The Stack——一个大型宽许可代码数据集，以促进开放和可重复的代码大模型研究。
描述数据收集、许可治理和近去重方法。
展示使用 Python 子集时近去重对模型性能的影响。
表明在宽许可数据上训练的模型能够达到或超过先前的文本到代码基准。

提出的方法

从 GHArchive 收集 GitHub 仓库并克隆 137.36M 个仓库（未经压缩为 92 TB）。
使用 GHArchive 数据和 go-license-detector 对许可进行分类，创建一个宽许可子集。
使用 MinHash 和局部敏感哈希进行近去重以减少重复。
在 Python 子集上以因果语言建模目标训练 350M 参数的解码器-仅变换器。
在 HumanEval 和 MBPP 基准上进行评估，并与 Codex、CodeGen 及 CodeParrot 进行比较。

实验结果

研究问题

RQ1The Stack 的规模与组成是什么，宽许可如何影响可用数据？
RQ2近去重是否会提升文本到代码任务中的代码生成性能？
RQ3在去重的前提下，宽许可数据能否重现或超过现有的文本到代码基准？
RQ4哪种治理模型使开发者能够选择退出被纳入 The Stack？

主要发现

近去重在所有实验中都显著提升了所有许可数据集的性能。
在宽许可数据上并结合近去重训练的 350M 参数模型，在 HumanEval 和 MBPP 基准上与 Codex 和 CodeGen 匹配，使用近去重的全许可数据时超过它们。
带近去重的全许可数据集在 pass@100 上高于未去重的对照组（HumanEval: 44.00% vs 36.67%；MBPP: 61.00% vs 53.59%）。
带近去重的宽许可数据在 HumanEval 上的 pass@100 为 37.00%，在 MBPP 上为 54.69%，接近或超过 CodeGen 的结果。
在本研究中，移除污染数据对性能的影响有限。
CodeParrot 的训练数据在这两个基准上均不如 The Stack。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。