QUICK REVIEW

[论文解读] Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code

Rafael-Michael Karampatsis, Hlib Babii|arXiv (Cornell University)|Mar 17, 2020

Software Engineering Research参考文献 82被引用 76

一句话总结

论文分析源代码语言模型的词汇设计，介绍使用字节对编码的开放词汇 NLM，扩展到大规模语料，并在 Java、C、Python 上优于前沿模型。

ABSTRACT

Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale. In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work; and 3) showing that such models outperform the state of the art on three distinct code corpora (Java, C, Python). To our knowledge, these are the largest NLMs for code that have been reported. All datasets, code, and trained models used in this work are publicly available.

研究动机与目标

研究词汇设计选择如何影响代码语言模型中的词汇规模和 OOV 率。
开发并评估基于子词单元的大规模开放词汇神经语言模型用于源代码。
证明开放词汇模型在多种编程语言上优于现有的最先进模型。
评估开放词汇模型对下游任务如代码补全的影响以及漏洞代码标注等的效用。

提出的方法

系统性评估源代码的词汇设计选择（注释、字符串、空白、过滤和令牌拆分）对词汇规模的影响。
提出使用字节对编码来生成子词单元的开放词汇 NLM。
在 Java、C、Python 的多达 13,362 个项目上训练基于 GRU 的 RNN LM。
在代码补全方面比较开放词汇 NLM 与 n-gram LMs 和封闭词汇 NLM 的性能。
评估语言模型改进对下游任务如漏洞代码标注等的迁移效果。
描述并公开发布数据集、代码和训练模型。

实验结果

研究问题

RQ1源代码语言模型中的词汇设计选择如何影响词汇规模和 OOV 率？
RQ2带子词单元的开放词汇模型是否能扩展到大规模代码语料和多样化语言？
RQ3与基线相比，开放词汇模型在 Java、C、Python 的代码补全方面是否有所提升？
RQ4语言模型的改进是否会转化为下游软件工程任务如缺陷检测/定位的改进？

主要发现

词汇设计选择对词汇规模和 OOV 率有显著影响，在多项实验中效果显著。
仅仅依赖简单的拆分不足以有效管理词汇，需更复杂的子词方法。
字节对编码（BPE）实现真正的开放词汇，降低 OOV 问题，同时保持预测能力。
开放词汇 NLM 在多语言的代码补全任务中优于 n-gram LMs 和封闭词汇 NLMs。
开放词汇模型在标注漏洞代码等下游任务中也显示出更好的性能，表明对下游 SE 任务的迁移性。
该方法可扩展到在数千个项目上进行训练，是当时报告的最大的代码相关神经模型。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。