QUICK REVIEW

[논문 리뷰] Maybe Deep Neural Networks are the Best Choice for Modeling Source Code

Rafael-Michael Karampatsis, Charles Sutton|arXiv (Cornell University)|2019. 03. 13.

Software Engineering Research참고 문헌 63인용 수 44

한 줄 요약

이 논문은 바이트 페어 엔코딩(BPE)으로 학습된 하위어 단위를 사용하는 소스 코드용 오픈-어휘 신경 언어 모델을 제시하며, Java, C, and Python 전반에서 최첨단 결과를 달성하고 새로운 프로젝트에 빠르게 적응할 수 있게 한다.

ABSTRACT

Statistical language modeling techniques have successfully been applied to source code, yielding a variety of new software development tools, such as tools for code suggestion and improving readability. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. But traditional language models limit the vocabulary to a fixed set of common words. For code, this strong assumption has been shown to have a significant negative effect on predictive performance. But the open vocabulary version of the neural network language models for code have not been introduced in the literature. We present a new open-vocabulary neural language model for code that is not limited to a fixed vocabulary of identifier names. We employ a segmentation into subword units, subsequences of tokens chosen based on a compression criterion, following previous work in machine translation. Our network achieves best in class performance, outperforming even the state-of-the-art methods of Hellendoorn and Devanbu that are designed specifically to model code. Furthermore, we present a simple method for dynamically adapting the model to a new test project, resulting in increased performance. We showcase our methodology on code corpora in three different languages of over a billion tokens each, hundreds of times larger than in previous work. To our knowledge, this is the largest neural language model for code that has been reported.

연구 동기 및 목표

언어 모델링에서 코드 식별자의 OOV 문제를 해결한다.
코드 모델링을 개선하기 위해 서브워드 단위를 기반으로 하는 오픈-어휘 신경 언어 모델을 제안한다.
여러 프로그래밍 언어와 대규모 데이터셋에 걸쳐 최첨단 예측 성능을 입증한다.
새로운 프로젝트에 모델을 적응시키는 간단한 방법을 통해 실용적 유용성을 보여준다.

제안 방법

BPE로 학습된 서브워드 단위에 대해 단일 계층 GRU 신경 언어 모델을 사용한다.
코드 토큰을 서브워드 단위로 분할하여 오픈 어휘를 만든다.
대규모 코드 말뭉치에서 학습하고 교차 엔트로피 및 MRR 지표로 평가한다.
서브워드 시퀀스에서 상위-k 개의 완전 토큰을 예측하기 위한 빔 서치에 유사한 절차를 제공한다.
새로운 프로젝트에서 단일 기울기 업데이트 단계로 글로벌 모델을 업데이트하는 간단한 동적 적응 절차를 개발한다.

실험 결과

연구 질문

RQ1여러 프로그래밍 언어에 걸쳐 오픈-어휘 신경 언어 모델(NLM)이 고정-어휘 모델을 능가할 수 있는가?
RQ2BPE로 학습된 서브워드 단위가 코드의 OOV 식별자 처리 향상에 기여하는가?
RQ3일반화를 보존하면서 대형 사전 학습 코드 LM을 새로운 프로젝트에 효과적으로 적응시키는 것이 가능한가?
RQ4제안된 모델이 말뭉치 크기 및 다양한 언어(Java, C, Python)에서 얼마나 확장되는가?

주요 결과

서브워드 단위를 가지는 오픈-어휘 NLM은 Java, C, Python 전반에서 코드에 대해 이전의 신경망 및 n-그램 LM을 능가한다.
Java에서, 크로스프로젝트 평가에서 토큰당 3.15 비트 및 70.84% MRR, 내부 프로젝트 평가에서 1.04 비트 및 81.16% MRR를 달성한다.
본 연구는 언어당 1.7 billion tokens를 포함하는 말뭉치를 사용한 것으로, 지금까지 보고된 코드용 신경 LM 중 최대 크기이다.
서브워드 방식은 학습 데이터 내에서 발생하는 서브토큰 통계를 활용하여 OOV 식별자의 예측을 가능하게 한다.
간단한 적응 방법은 새로 만나는 시퀀스마다 단일 gradient step으로 글로벌 모델을 업데이트하여 새 프로젝트에 맞춘다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.