QUICK REVIEW

[論文レビュー] Maybe Deep Neural Networks are the Best Choice for Modeling Source Code

Rafael-Michael Karampatsis, Charles Sutton|arXiv (Cornell University)|Mar 13, 2019

Software Engineering Research参考文献 63被引用数 44

ひとこと要約

本論文は、Byte Pair Encoding (BPE) によって学習されたサブワード単位を用いるソースコード向けのオープンボキャブラリ神経言語モデルを提案し、Java・C・Pythonの各言語で最先端の成果を達成するとともに、新しいプロジェクトへの迅速な適応を可能にする。

ABSTRACT

Statistical language modeling techniques have successfully been applied to source code, yielding a variety of new software development tools, such as tools for code suggestion and improving readability. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. But traditional language models limit the vocabulary to a fixed set of common words. For code, this strong assumption has been shown to have a significant negative effect on predictive performance. But the open vocabulary version of the neural network language models for code have not been introduced in the literature. We present a new open-vocabulary neural language model for code that is not limited to a fixed vocabulary of identifier names. We employ a segmentation into subword units, subsequences of tokens chosen based on a compression criterion, following previous work in machine translation. Our network achieves best in class performance, outperforming even the state-of-the-art methods of Hellendoorn and Devanbu that are designed specifically to model code. Furthermore, we present a simple method for dynamically adapting the model to a new test project, resulting in increased performance. We showcase our methodology on code corpora in three different languages of over a billion tokens each, hundreds of times larger than in previous work. To our knowledge, this is the largest neural language model for code that has been reported.

研究の動機と目的

言語モデル化におけるコード識別子の未出現語（OOV）問題に対処する。
コードモデリングを改善するため、サブワード単位に基づくオープンボキャブラリ神経言語モデルを提案する。
複数の言語と大規模データセットにおいて最先端の予測性能を示す。
新しいプロジェクトへ適用する簡便な適応手法を通じて実用的な有用性を示す。

提案手法

BPE で学習したサブワード単位上での1層GRU神経言語モデルを用いる。
コードトークンをサブワード単位に分割してオープンボキャブラリを作成する。
大規模なコードコーパスで訓練し、クロスエントロピーとMRR指標で評価する。
サブワード列から上位 k 個の完全なトークンを予測するビームサーチ風の手順を提供する。
新プロジェクト上で単一の勾配ステップでグローバルモデルを更新する単純な動的適応手順を開発する。

実験結果

リサーチクエスチョン

RQ1オープンボキャブラリ神経言語モデル（NLM）は、複数のプログラミング言語に跨ってコードに対して固定語彙モデルを上回ることができるか？
RQ2BPE で学習されたサブワード単位は、コード内のOOV識別子の扱いを改善するか？
RQ3汎化性を保ちながら大規模で事前学習済みのコードLMを新しいプロジェクトに効率的に適応させることは可能か？
RQ4提案モデルはコーパスサイズと異なる言語（Java・C・Python）でどのようにスケールするか？

主な発見

サブワード単位を用いたオープンボキャブラリNLMは、Java・C・Pythonのコードに対して従来のニューラルおよびn-gram LMsを上回る。
Java では、クロスプロジェクト評価で 3.15 bits per token、MRR 70.84%、同プロジェクト内評価で 1.04 bits per token、MRR 81.16% を達成。
本研究は言語ごとに10億を超えるトークンのコーパスを使用しており、これまでで最大のコード用ニューラルLM（17億トークン）となる。
サブワード手法は、訓練データ内で発生するサブトークン統計を活用することでOOV識別子の予測を可能にする。
遭遇する各シーケンスごとに1回の勾配ステップでグローバルモデルを更新して新しいプロジェクトに適合させる単純な適応法。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。