QUICK REVIEW

[논문 리뷰] Structural Language Models of Code

Uri Alon, Roy Sadaka|arXiv (Cornell University)|2019. 09. 30.

Software Engineering Research인용 수 44

한 줄 요약

본 논문은 여러 AST 경로를 통해 추상 구문 트리(AST) 노드를 예측하여 코드 전체 코드 완성을 수행하는 Structural Language Modeling(SLM)을 제시하고, Java 전체 코드 완성에서 최첨단 결과와 C# 제한 완성에서의 강한 향상을 달성한다.

ABSTRACT

We address the problem of any-code completion - generating a missing piece of source code in a given program without any restriction on the vocabulary or structure. We introduce a new approach to any-code completion that leverages the strict syntax of programming languages to model a code snippet as a tree - structural language modeling (SLM). SLM estimates the probability of the program's abstract syntax tree (AST) by decomposing it into a product of conditional probabilities over its nodes. We present a neural model that computes these conditional probabilities by considering all AST paths leading to a target node. Unlike previous techniques that have severely restricted the kinds of expressions that can be generated in this task, our approach can generate arbitrary code in any programming language. Our model significantly outperforms both seq2seq and a variety of structured approaches in generating Java and C# code. Our code, data, and trained models are available at http://github.com/tech-srl/slm-code-generation/ . An online demo is available at http://AnyCodeGen.org .

연구 동기 및 목표

언어와 구조가 제한 없이 자유로운 any-code 완성 문제를 동기부여한다.
코드를 AST로 취급하고 노드-별로 예측하는 구조적 언어 모델링 접근법을 제안한다.
AST 경로에 대한 소스-타깃 결합 모델이 시퀀스 및 다른 구조적 baselines보다 생성 품질을 향상시킨다는 것을 보여준다.
Java any-code 완성에서 최첨단 결과와 C# 제한 완성에서 강한 향상을 입증한다.
성능에 필수적인 구성 요소를 식별하기 위한 제거 실험(ablation) 분석을 제공한다.

제안 방법

프로그램을 AST로 표현하고 Pr(A_P)를 AST 탐색에 걸친 조건부 노드 확률의 곱으로 분해한다.
부분 트리를 루트-에서 리프까지의 경로 집합으로 표현하고 각 경로를 노드 임베딩의 LSTM으로 인코딩한다.
여러 경로 인코딩을 Transformer 기반 컨텍스트와 인덱스 정보가 반영된 루트-경로 인코딩으로 집계하여 다음 노드를 예측한다.
경로 인코딩에서의 복사 점수와 부분토큰 임베딩을 결합하는 구문적 복사 메커니즘을 사용하여 다음 AST 노드나 subtokens를 예측한다.
생성에 EOS 노드/토큰을 추가하여 트리 생성의 진폭과 깊이를 제어한다.
Adam을 사용한 교차 엔트로피로 엔드투엔드 학습하고 추론 시 빔 서치를 수행한다; NMT 및 코드-구조 baselines와 비교한다.

실험 결과

연구 질문

RQ1Can any-code completion be effectively modeled by AST-path conditioned probabilities rather than flat sequences?
RQ2Does jointly modeling source and target code as the same tree improve generation quality over encoder-decoder or production-rule based approaches?
RQ3What is the impact of path-based representations, attention aggregation, and code-copy mechanisms on exact-match and tree-structure accuracy?
RQ4How does SLM perform on Java any-code completion and C# restricted completion relative to strong baselines?
RQ5What ablations reveal the contribution of components like root attention, copying, and path-based representations?

주요 결과

SLM achieves state-of-the-art exact-match acc@1 and acc@5 on Java any-code completion: 18.04% and 24.83% respectively, with tree@1 39.10% and tree@5 55.32%.
On Java, SLM outperforms all baselines including code2seq, seq2tree, and Transformer variants, with notable gains in acc@1 and acc@5.
In restricted C# completion, SLM reaches 37.61% acc@1 and 45.51% acc@5, and tree@1 51.10% with tree@5 59.82%, surpassing GNN→NAG and other baselines.
Ablations show joint modeling (Paths→Paths) and copy mechanisms are crucial; removing root attention or copy reduces performance significantly (e.g., No Copy severely drops metrics).
The tree@k metric indicates models often predict correct syntax even when subtokens differ, highlighting the potential for further gains through better token/name prediction.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.