Skip to main content
QUICK REVIEW

[论文解读] Structural Language Models of Code

Uri Alon, Roy Sadaka|arXiv (Cornell University)|Sep 30, 2019
Software Engineering Research被引用 44
一句话总结

该论文提出结构化语言建模(SLM)通过在多条AST路径上预测代码的抽象语法树(AST)节点来完成任意代码完成,在Java任意代码完成方面达到最先进的结果,并在C#受限完成方面取得显著提升。

ABSTRACT

We address the problem of any-code completion - generating a missing piece of source code in a given program without any restriction on the vocabulary or structure. We introduce a new approach to any-code completion that leverages the strict syntax of programming languages to model a code snippet as a tree - structural language modeling (SLM). SLM estimates the probability of the program's abstract syntax tree (AST) by decomposing it into a product of conditional probabilities over its nodes. We present a neural model that computes these conditional probabilities by considering all AST paths leading to a target node. Unlike previous techniques that have severely restricted the kinds of expressions that can be generated in this task, our approach can generate arbitrary code in any programming language. Our model significantly outperforms both seq2seq and a variety of structured approaches in generating Java and C# code. Our code, data, and trained models are available at http://github.com/tech-srl/slm-code-generation/ . An online demo is available at http://AnyCodeGen.org .

研究动机与目标

  • 激励任意代码完成问题,使词汇和结构不受限制。
  • 提出一种结构化语言建模方法,将代码视为AST并逐节点进行预测。
  • 展示在AST路径上的源-目标联合建模在生成质量上优于序列及其他结构基线。
  • 在Java任意代码完成方面展示最先进的结果,在C#受限完成方面取得显著提升。
  • 提供消融分析以识别对性能关键的组件。

提出的方法

  • 将程序表示为AST,并将Pr(A_P)分解为跨AST遍历的条件节点概率乘积。
  • 用从根到叶的路径集合表示部分树,并对每条路径在节点嵌入上使用LSTM进行编码。
  • 用基于Transformer的上下文聚合多条路径编码,并用带索引信息的根路径编码来预测下一个节点。
  • 使用语法复制机制预测下一个AST节点或子符号,将来自路径编码的复制分数与子符号嵌入相结合。
  • 通过EOS节点/标记来扩展生成,以控制树生成中的基数和深度。
  • 使用Adam进行端到端的交叉熵训练,在推理阶段进行束搜索;与NMT和代码结构基线进行比较。

实验结果

研究问题

  • RQ1Can any-code completion be effectively modeled by AST-path conditioned probabilities rather than flat sequences?
  • RQ2Does jointly modeling source and target code as the same tree improve generation quality over encoder-decoder or production-rule based approaches?
  • RQ3What is the impact of path-based representations, attention aggregation, and code-copy mechanisms on exact-match and tree-structure accuracy?
  • RQ4How does SLM perform on Java any-code completion and C# restricted completion relative to strong baselines?
  • RQ5What ablations reveal the contribution of components like root attention, copying, and path-based representations?

主要发现

  • SLM achieves state-of-the-art exact-match acc@1 and acc@5 on Java any-code completion: 18.04% and 24.83% respectively, with tree@1 39.10% and tree@5 55.32%.
  • On Java, SLM outperforms all baselines including code2seq, seq2tree, and Transformer variants, with notable gains in acc@1 and acc@5.
  • In restricted C# completion, SLM reaches 37.61% acc@1 and 45.51% acc@5, and tree@1 51.10% with tree@5 59.82%, surpassing GNN→NAG and other baselines.
  • Ablations show joint modeling (Paths→Paths) and copy mechanisms are crucial; removing root attention or copy reduces performance significantly (e.g., No Copy severely drops metrics).
  • The tree@k metric indicates models often predict correct syntax even when subtokens differ, highlighting the potential for further gains through better token/name prediction.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。