QUICK REVIEW

[论文解读] Learning and Evaluating Contextual Embedding of Source Code

Aditya Kanade, Petros Maniatis|arXiv (Cornell University)|Dec 21, 2019

Software Engineering Research被引用 156

一句话总结

CuBERT 在一个大型去重的 GitHub 语料库上为 Python 代码预训练上下文嵌入，并在多项代码理解任务上展示了强劲的微调性能，超过 Word2Vec 基线、BiLSTM，以及从头训练的 Transformer，且数据效率高。

ABSTRACT

Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come with the development of pre-trained contextual embeddings, such as BERT, which can be fine-tuned for downstream tasks with less labeled data and training budget, while achieving better accuracies. However, there is no attempt yet to obtain a high-quality contextual embedding of source code, and to evaluate it on multiple program-understanding tasks simultaneously; that is the gap that this paper aims to mitigate. Specifically, first, we curate a massive, deduplicated corpus of 7.4M Python files from GitHub, which we use to pre-train CuBERT, an open-sourced code-understanding BERT model; and, second, we create an open-sourced benchmark that comprises five classification tasks and one program-repair task, akin to code-understanding tasks proposed in the literature before. We fine-tune CuBERT on our benchmark tasks, and compare the resulting models to different variants of Word2Vec token embeddings, BiLSTM and Transformer models, as well as published state-of-the-art models, showing that CuBERT outperforms them all, even with shorter training, and with fewer labeled examples. Future work on source-code embedding can benefit from reusing our benchmark, and from comparing against CuBERT models as a strong baseline.

研究动机与目标

动机：利用类似于 NLP 中 BERT 的上下文嵌入，推动源代码表示学习的改进。
创建一个大型、去重的 Python 语料库用于 CuBERT 的预训练。
设计一个连贯的多任务 Python 代码基准，涵盖分类和程序修复任务。
将 CuBERT 与强基线（Word2Vec、BiLSTM、Transformer）以及最先进的方法进行对比评估。
发布模型和数据集，以促进未来的研究和基准测试。

提出的方法

在一个去重的 7.4M Python 文件语料库（9.3B tokens）上对 CuBERT 进行预训练，使用 Python 特定的分词和子词词汇。
使用 BERT 风格的 MLM 和 Next-Sentence Prediction 目标，将 CuBERT 视为基于行的代码输入，其中一行为一条逻辑代码行。
在五个分类任务和一个基于指针的变量错用定位/修复任务上微调 CuBERT。
将 CuBERT 与基于 Word2Vec 的 BiLSTM、从头训练的 Transformers，以及已发表的最先进模型进行比较。
在不同的示例长度和微调数据预算下进行评估，以评估数据效率和上下文效应。
提供开源代码和数据集以便复用基准测试。

实验结果

研究问题

RQ1在对未标注代码进行预训练后，上下文嵌入是否能提升源代码分析？
RQ2对基于 Transformer 的模型进行微调是否比从头训练带来更多收益？
RQ3CuBERT 的性能在任务特定标注数据有限时如何扩展？
RQ4上下文大小（示例长度）如何影响 CuBERT 在代码任务上的表现？
RQ5相较于先前的最先进方法，CuBERT 在变量错用定位/修复等复杂任务上的表现如何？

主要发现

模型	变量错用	运算符	操作数	文档字符串	异常
BiLSTM From scratch	76.2927%	83.648163%	88.07047%	76.010776%	52.78638%
CBOW ns	80.32751%	86.81924%	89.797926%	89.075357%	67.008513%
CuBERT 2 epochs	94.042%	89.89658%	92.198956%	97.20764%	61.039084%
CuBERT 10 epochs	95.13968%	92.150164%	93.622464%	98.07754%	77.9702%
CuBERT 20 epochs	95.213145%	92.46354%	93.35517%	98.08504%	79.12152%
Transformer 100 epochs	78.28434%	76.554555%	87.82762%	91.017634%	49.56463%

CuBERT 在所有分类任务中，始终以最佳 Word2Vec 嵌入超过 BiLSTMs，测试准确率提升幅度为 3.2% 到 14.7%。
CuBERT 仅需 2–20 个微调轮次即可取得强劲结果，接近或超过全数据基线。
在 33%–100% 的任务数据上微调 CuBERT，获得与使用全数据训练的基线相比具竞争力或更优的性能。
CuBERT 在变量错用定位和修复任务上显著超越最先进模型。
与从头训练的 Transformer 相比，CuBERT（预训练+微调）显著获得更高的准确率，表明预训练对代码表示的价值。
作者提供开源模型和基准测试，便于未来研究。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。