QUICK REVIEW

[论文解读] Maybe Deep Neural Networks are the Best Choice for Modeling Source Code

Rafael-Michael Karampatsis, Charles Sutton|arXiv (Cornell University)|Mar 13, 2019

Software Engineering Research参考文献 63被引用 44

一句话总结

该论文提出一种用于源代码的开放词汇神经语言模型，使用由字节对编码 (BPE) 学习的子词单元，在 Java、C 和 Python 上取得最先进的结果，并实现对新项目的快速适应。

ABSTRACT

Statistical language modeling techniques have successfully been applied to source code, yielding a variety of new software development tools, such as tools for code suggestion and improving readability. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. But traditional language models limit the vocabulary to a fixed set of common words. For code, this strong assumption has been shown to have a significant negative effect on predictive performance. But the open vocabulary version of the neural network language models for code have not been introduced in the literature. We present a new open-vocabulary neural language model for code that is not limited to a fixed vocabulary of identifier names. We employ a segmentation into subword units, subsequences of tokens chosen based on a compression criterion, following previous work in machine translation. Our network achieves best in class performance, outperforming even the state-of-the-art methods of Hellendoorn and Devanbu that are designed specifically to model code. Furthermore, we present a simple method for dynamically adapting the model to a new test project, resulting in increased performance. We showcase our methodology on code corpora in three different languages of over a billion tokens each, hundreds of times larger than in previous work. To our knowledge, this is the largest neural language model for code that has been reported.

研究动机与目标

在语言建模中解决代码标识符的词汇外问题。
提出一个基于子词单元的开放词汇神经语言模型以改进代码建模。
展示在多种语言和大规模数据集上的最先进预测性能。
通过一种简单的方法将模型适应到新项目，展示其实用性。

提出的方法

在由 BPE 学习的子词单元上使用单层 GRU 神经语言模型。
将代码标记分段为子词单元以创建开放词汇。
在大规模代码语料库上训练，并以交叉熵和 MRR 指标进行评估。
提供一个类似束搜索的过程，从子词序列中预测前 k 个完整标记。
开发一种简单的动态适应过程，在新项目上用一个梯度步更新全局模型。

实验结果

研究问题

RQ1开放词汇神经语言模型（NLMs）是否能在多种编程语言的代码上超越固定词汇模型？
RQ2通过 BPE 学习的子词单元是否改善对代码中 OOV 标识符的处理？
RQ3在不牺牲泛化性的前提下，将一个大型、预训练的代码语言模型高效适应到新项目是否可行？
RQ4所提出的模型在语料规模和不同语言（Java、C、Python）下的扩展性如何？

主要发现

开放词汇 NLM 与子词单元在 Java、C 和 Python 的代码上优于现有的神经和 n-gram 语言模型。
在 Java 上，该模型在跨项目评估中达到每标记 3.15 比特且 MRR 为 70.84%，在同一项目评估中达到每标记 1.04 比特且 MRR 为 81.16%。
该研究使用每种语言超过十亿标记的语料库，是迄今为止报道的最大的代码神经语言模型（1.7 billion tokens）。
子词方法通过在训练数据中出现的子标记统计信息来实现对 OOV 标识符的预测。
一种简单的自适应方法，在遇到的新序列上进行单次梯度步即可将全局模型更新，以便使其更适合新项目。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。