QUICK REVIEW

[論文レビュー] The Stack: 3 TB of permissively licensed source code

Denis Kocetkov, Raymond Li|arXiv (Cornell University)|Nov 20, 2022

Topic Modeling被引用数 38

ひとこと要約

The Stack は 30 言語にまたがる 3.1 TB の緩やかにライセンスされたソースコードデータセットで、近似的重複排除によりコードモデルの性能が向上します。緩やかにライセンスされたデータは、重複排除とデコンタミネーションが行われた場合、従来のテキストからコードへの結果に匹敵するか優れることがあります。

ABSTRACT

Large Language Models (LLMs) play an ever-increasing role in the field of Artificial Intelligence (AI)--not only for natural language processing but also for code understanding and generation. To stimulate open and responsible research on LLMs for code, we introduce The Stack, a 3.1 TB dataset consisting of permissively licensed source code in 30 programming languages. We describe how we collect the full dataset, construct a permissively licensed subset, present a data governance plan, discuss limitations, and show promising results on text2code benchmarks by training 350M-parameter decoders on different Python subsets. We find that (1) near-deduplicating the data significantly boosts performance across all experiments, and (2) it is possible to match previously reported HumanEval and MBPP performance using only permissively licensed data. We make the dataset available at https://hf.co/BigCode, provide a tool called "Am I in The Stack" (https://hf.co/spaces/bigcode/in-the-stack) for developers to search The Stack for copies of their code, and provide a process for code to be removed from the dataset by following the instructions at https://www.bigcode-project.org/docs/about/the-stack/.

研究の動機と目的

The Stack を紹介し、オープンで再現性のあるコード LLM 研究を促進する大規模な緩やかにライセンスされたコードデータセット。
データ収集、ライセンスのガバナンス、近接重複排除の方法を説明する。
Python サブセットを用いたモデルの性能に対する近接重複排除の影響をデモする。
緩やかにライセンスされたデータで学習したモデルが、従来のテキストからコードへのベンチマークに匹敵するか超えることができることを示す。

提案手法

GHArchive から GitHub リポジトリを収集し、137.36M リポジトリをクローンする（92 TB の非圧縮データ）。
GHArchive データと go-license-detector を用いてライセンスを分類し、緩やかにライセンスされたサブセットを作成する。
MinHash および Locality Sensitive Hashing を用いて重複を減らす近接重複排除を適用する。"
Python サブセット上で causal language modeling 目的を持つ 350M パラメータのデコーダー専用トランスフォーマを訓練する。
HumanEval および MBPP ベンチマークで評価し、Codex、CodeGen、CodeParrot と比較する。"],
research_questions':['What is the size and composition of The Stack, and how permissive licensing affects usable data?','Does near-deduplication improve code generation performance in text-to-code tasks?','Can permissively licensed data, with deduplication, reproduce or exceed existing text-to-code benchmarks?','What governance model enables developers to opt out of inclusion in The Stack?'],
key_findings the translation without errors?
Near-deduplication substantially boosts performance across all experiments for both all-license and permissive-license datasets.
A 350M parameter model trained on permissively licensed data with near-deduplication matches Codex and CodeGen benchmarks on HumanEval and MBPP, and surpasses them when using the near-deduplicated all-license data.
The all-license dataset with near-deduplication achieves higher pass@100 than its non-deduplicated counterpart (HumanEval: 44.00% vs 36.67%; MBPP: 61.00% vs 53.59%).
Permissive-license data with near-deduplication yields pass@100 of 37.00% on HumanEval and 54.69% on MBPP, approaching or exceeding CodeGen results.
Removing contaminated data had limited impact on performance in this study.
CodeParrot training data underperforms relative to The Stack on both benchmarks.

実験結果

リサーチクエスチョン

主な発見

Model	Filtering	Pass@1	Pass@10	Pass@100
Codex (300M)	None	13.17	20.17	36.27
CodeGen (350M)	None	12.76	23.11	35.19
Python all-license	None	13.11	21.77	36.67
Python all-license	Near-dedup	16.60	27.82	44.00
Python all-license	Near-dedup + Decontamination	17.34	27.64	45.52
Python permissive-license	None	10.99	15.94	27.21
Python permissive-license	Near-dedup	13.94	22.36	37.00
Python permissive-license	Near-dedup + Decontamination	12.89	22.26	36.01
CodeParrot	Near-dedup	11.23	18.16	30.37
CodeParrot	Near-dedup + Decontamination	21.82	37.55	58.28

Near-deduplication substantially boosts performance across all experiments for both all-license and permissive-license datasets.
A 350M parameter model trained on permissively licensed data with near-deduplication matches Codex and CodeGen benchmarks on HumanEval and MBPP, and surpasses them when using the near-deduplicated all-license data.
The all-license dataset with near-deduplication achieves higher pass@100 than its non-deduplicated counterpart (HumanEval: 44.00% vs 36.67%; MBPP: 61.00% vs 53.59%).
Permissive-license data with near-deduplication yields pass@100 of 37.00% on HumanEval and 54.69% on MBPP, approaching or exceeding CodeGen results.
Removing contaminated data had limited impact on performance in this study.
CodeParrot training data underperforms relative to The Stack on both benchmarks.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。