[论文解读] DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation
DS-1000 是跨七个 Python 库的千个来自 StackOverflow 的自然数据科学编码问题,具备多标准基于执行的评估与记忆防御。它对 Codex-002 和其他模型进行基准测试,显示存在相当大的改进空间。
We introduce DS-1000, a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as NumPy and Pandas. Compared to prior works, DS-1000 incorporates three core features. First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow. Second, our automatic evaluation is highly specific (reliable) -- across all Codex-002-predicted solutions that our evaluation accept, only 1.8% of them are incorrect; we achieve this with multi-criteria metrics, checking both functional correctness by running test cases and surface-form constraints by restricting API usages or keywords. Finally, we proactively defend against memorization by slightly modifying our problems to be different from the original StackOverflow source; consequently, models cannot answer them correctly by memorizing the solutions from pre-training. The current best public system (Codex-002) achieves 43.3% accuracy, leaving ample room for improvement. We release our benchmark at https://ds1000-code-gen.github.io.
研究动机与目标
- 介绍 DS-1000:一千个来自 StackOverflow 的真实世界数据科学编码问题。
- 提供一个可靠的、基于执行的评估,包含功能正确性和表述形式检查。
- 通过对问题及参考解进行扰动来防御记忆化。
- 评估最先进的代码模型,以建立基线并识别改进领域。
提出的方法
- curate 1000 problems from StackOverflow across seven libraries (NumPy, Pandas, TensorFlow, PyTorch, SciPy, Scikit-learn, Matplotlib );
- rewrite problems and reference solutions to ensure executability and unambiguity;
- implement multi-criteria evaluation with test cases for functional correctness and surface-form constraints;
- perturb problems to defend against memorization (surface, semantic, and difficult rewrites);
- quality-review and calibration to measure false discovery/omission rates of the automatic metric;
- benchmark Codex-002, CodeGen, and InCoder using both Left-to-right Completion and Insertion (infilling) formats.
实验结果
研究问题
- RQ1如何让数据科学代码生成基准体现来自真实世界问题的自然化意图与语境?
- RQ2多标准、基于执行的评估是否可以可靠地衡量数据科学任务的代码生成质量?
- RQ3大型代码模型(如 Codex-002)在 DS-1000 问题上有何提升,记忆化如何影响表现?
- RQ4插入(insertion)格式是否提高数据科学代码生成任务的模型表现?
主要发现
| Format | Model | Pandas | NumPy | Matplotlib | Scikit-learn | SciPy | TensorFlow | PyTorch | 总体 |
|---|---|---|---|---|---|---|---|---|---|
| 从左到右完成 | Codex-002 | 26.5 | 43.1 | 57.0 | 44.8 | 31.8 | 39.3 | 41.8 | 39.2 |
| 从左到右完成 | Codex-001 | 9.4 | 26.6 | 41.8 | 18.5 | 15.0 | 17.2 | 9.7 | 20.2 |
| 从左到右完成 | Codex-Cushman | 7.9 | 21.8 | 40.7 | 18.0 | 11.3 | 12.2 | 12.4 | 18.1 |
| 插入 | Codex-002 | 30.1 | 46.5 | 57.0* | 53.7 | 34.8 | 53.4 | 47.7 | 43.3 |
| 插入 | InCoder-6B | 2.9 | 4.6 | 28.3* | 3.1 | 3.1 | 7.8 | 3.2 | 7.5 |
- DS-1000 含有跨七个库的一千个问题,底层 StackOverflow 问题共 451 个,平均每个问题有 1.6 个测试用例。
- 自动化评估在样本层面通过预测的假阳性率为 1.8%,显示出可靠性。
- Codex-002 Insertion 在 DS-1000 上实现了最佳平均通过率 43.3%,相比其他模型有显著改进空间。
- 当对问题进行扰动以防记忆化时,性能下降,表明部分此前的模型成功受记忆化影响。
- 插入格式通常比完成格式获得更高的准确性,凸显了填充能力的好处。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。