QUICK REVIEW

[论文解读] Training and Evaluating a Jupyter Notebook Data Science Assistant

Shubham Chandel, Colin B. Clement|arXiv (Cornell University)|Jan 30, 2022

Software Engineering Research被引用 20

一句话总结

本论文在公开的 Jupyter 笔记本上训练 JuPyT5，并引入 DSP，一个可执行的数据科学问题基准，显示 JuPyT5 在最多 100 次尝试中解决 DSP 任务的比例高达 77.5%，并分析训练策略、数据组成，以及与 HumanEval/MBPP 的评估对比。

ABSTRACT

We study the feasibility of a Data Science assistant powered by a sequence-to-sequence transformer by training a new model JuPyT5 on all publicly available Jupyter Notebook GitHub repositories and developing a new metric: Data Science Problems (DSP). DSP is a collection of 1119 problems curated from 306 pedagogical notebooks with 92 dataset dependencies, natural language and Markdown problem descriptions, and assert-based unit tests. These notebooks were designed to test university students' mastery of various Python implementations of Math and Data Science, and we now leverage them to study the ability of JuPyT5 to understand and pass the tests. We analyze the content of DSP, validate its quality, and we find that given 100 sampling attempts JuPyT5 is able to solve 77.5\% of the DSP problems. We further present various ablation and statistical analyses and compare DSP to other recent natural language to code benchmarks.

研究动机与目标

说明需要一个面向数据科学的教育性代码生成助手，该助手在基于任务的问题上进行评估。
创建 DSP，一个大型、可执行的基于笔记本的数据科学问题基准，带有单元测试。
在多样化的笔记本数据上，使用单元格填充预训练目标训练 JuPyT5。
评估训练数据组成和上下文大小如何影响问题解决性能。
将 DSP 结果与现有的代码编写基准（HumanEval/MBPP）进行比较，并分析局限性和部署考虑因素。

提出的方法

使用源自 Jupyter 笔记本单元格的单元格填充目标训练 JuPyT5（350M 参数）。
使用控制标记来区分 Markdown 与代码内容以及目标类型（例如 <markdown>, <code>, <function>, <class>, <import>）。
在以 Markdown 为重点的子集 vs 整个数据集上进行预训练，以研究具备可读性代码数据的影响。
通过在每个笔记本问题中替换解决方案单元并运行单元测试来评估 DSP；在多次尝试中衡量 pass@k。
使用标准基准将 JuPyT5 在 DSP 上的性能与 HumanEval 和 MBPP 进行比较。
分析消融研究和训练选择（上下文单元 C=1 与 C=3、包含后续评分单元的单元格填充）以确定对成功的贡献。

实验结果

研究问题

RQ1面向数据科学的 transformer 能否在包含单元测试的 Jupyter 笔记本中有效解决实际问题？
RQ2与基线上下文建模相比，单元格填充预训练目标是否提升对 DSP 的问题解决能力？
RQ3数据组成（Markdown-rich 与全部笔记本）以及前瞻测试可见性如何影响 DSP 的性能？
RQ4JuPyT5 在 DSP 上的表现与像 HumanEval 和 MBPP 这样的既定代码基准相比如何？
RQ5在教育或商业环境中使用 DSP 风格评估时，会产生哪些部署方面的考虑？

主要发现

模型	pass@1	pass@10	pass@50	pass@100
C=1 Baseline	6.5%	16.5%	22.7%	25.3%
C=1 MD Focused	7.1%	17.3%	26.2%	27.8%
C=1 Cell Infilling	22.3%	53.5%	65.0%	67.9%
C=3 Baseline	11.2%	25.6%	34.4%	37.9%
C=3 MD Focused	11.2%	28.4%	40.6%	43.9%
C=3 Cell Infilling	33.4%	63.5%	73.9%	77.5%

JuPyT5 在 100 次采样尝试下解决了 77.5% 的 DSP 问题。
带有后续评分单元可见性的单元格填充带来显著提升，在 C=1 时达到 67.9% pass@100，C=3 时达到 77.5% pass@100。
在 Markdown 为重点的子集上训练，相较于全数据训练带来温和的增益。
上下文窗口（C=3 vs C=1）在各个 k 上提供稳定提升，包含下一单元（测试可见性）时提升显著。
JuPyT5 在 MBPP 上优于一个更大规模的 68B 参数模型，在将文档字符串改写为 Markdown 时接近 Codex 在 HumanEval 的表现；然而 Codex 在 HumanEval 上通常仍然更强，突显出格式敏感性和数据域效应。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。