QUICK REVIEW

[論文レビュー] DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation

Yuhang Lai, Chengxi Li|arXiv (Cornell University)|Nov 18, 2022

Software Engineering Research被引用数 31

ひとこと要約

DS-1000 は seven Python ライブラリにまたがる thousand の自然なデータサイエンスコーディング問題で、実行ベースの多基準評価と記憶防御を備えています。Codex-002 ほかのモデルをベンチマークし、改善の余地が大きいことが示されています。

ABSTRACT

We introduce DS-1000, a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as NumPy and Pandas. Compared to prior works, DS-1000 incorporates three core features. First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow. Second, our automatic evaluation is highly specific (reliable) -- across all Codex-002-predicted solutions that our evaluation accept, only 1.8% of them are incorrect; we achieve this with multi-criteria metrics, checking both functional correctness by running test cases and surface-form constraints by restricting API usages or keywords. Finally, we proactively defend against memorization by slightly modifying our problems to be different from the original StackOverflow source; consequently, models cannot answer them correctly by memorizing the solutions from pre-training. The current best public system (Codex-002) achieves 43.3% accuracy, leaving ample room for improvement. We release our benchmark at https://ds1000-code-gen.github.io.

研究の動機と目的

DS-1000 を紹介する、StackOverflow から出典された千の実世界データサイエンスコーディング問題。
機能性と表現形式のチェックを含む信頼できる実行ベースの評価を提供する。
問題と参照解答を撹乱することで memorization に対する防御を行う。
最先端のコードモデルを評価してベースラインを確立し、改善領域を特定する。

提案手法

StackOverflow から七つのライブラリ（NumPy、Pandas、TensorFlow、PyTorch、SciPy、Scikit-learn、Matplotlib）にまたがる 1000 問題をキュレーションする；
実行可能性と曖昧さのないよう問題と参照解答を再作成する；
機能的正確性と表現形の制約を含むテストケースで多基準評価を実装する；
memorization に対抗するため問題を撹乱する（表層、意味、難易度の高い書き換え）；
自動メトリックの偽陽性/偽陰性率を測るための品質評価と校正；
Left-to-right Completion および Insertion（埋め込み）形式の両方を用いて Codex-002、CodeGen、InCoder をベンチマークする。

実験結果

リサーチクエスチョン

RQ1現実の問題からの自然な意図と背景を反映するデータサイエンスコード生成ベンチマークは作成可能か。
RQ2多基準・実行ベースの評価はデータサイエンスタスクのコード生成品質を信頼性高く測定できるか。
RQ3大規模コードモデル（例: Codex-002）は DS-1000 問題をどの程度改善するか、そして memorization が性能にどう影響するか。
RQ4挿入形式（インフィリング）はデータサイエンスコード生成タスクにおけるモデル性能を改善するか。

主な発見

Format	Model	Pandas	NumPy	Matplotlib	Scikit-learn	SciPy	TensorFlow	PyTorch	総合
Left-to-right Completion	Codex-002	26.5	43.1	57.0	44.8	31.8	39.3	41.8	39.2
Left-to-right Completion	Codex-001	9.4	26.6	41.8	18.5	15.0	17.2	9.7	20.2
Left-to-right Completion	Codex-Cushman	7.9	21.8	40.7	18.0	11.3	12.2	12.4	18.1
Insertion	Codex-002	30.1	46.5	57.0*	53.7	34.8	53.4	47.7	43.3
Insertion	InCoder-6B	2.9	4.6	28.3*	3.1	3.1	7.8	3.2	7.5

DS-1000 は七つのライブラリにまたがる 1000 問題を含み、451 の基礎StackOverflow 問題と 1.6 件のテストケース/問題を平均している。
自動評価は、パス予測のサンプルレベルで偽陽性率 1.8% の低さを達成しており、信頼性を示す。
Codex-002 Insertion は DS-1000 での平均パス@1 が 43.3% で最も高く、他のモデルより大幅に改善の余地がある。
memorization に対抗するため問題を撹乱すると性能が低下し、以前のモデルの成功の一部が memorization に影響されていたことを示している。
Insertion 形式は一般に Completion 形式より精度が高く、埋め込み能力の利点を示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。