QUICK REVIEW

[论文解读] Tree2Tree Neural Translation Model for Learning Source Code Changes.

Saikat Chakraborty, Miltiadis Allamanis|arXiv (Cornell University)|Sep 30, 2018

Software Engineering Research参考文献 54被引用 21

一句话总结

本文提出CODIT，一种基于树的神经机器翻译模型，通过将代码建模为抽象语法树（AST），从真实世界的补丁中学习代码变更模式。在24,000次代码变更上进行训练，并在5,000个补丁上进行评估，CODIT在建议准确代码变更方面表现优异，并修复了Defects4J中的25个错误，证明其在学习语法正确且可重用的代码转换方面的有效性。

ABSTRACT

The way developers edit day-to-day code tends to be repetitive, often using existing code elements. Many researchers have tried to automate repetitive code changes by learning from specific change templates which are applied to limited scope. The advancement of deep neural networks and the availability of vast open-source evolutionary data opens up the possibility of automatically learning those templates from the wild. However, deep neural network based modeling for code changes and code in general introduces some specific problems that needs specific attention from research community. For instance, compared to natural language, source code vocabulary can be significantly larger. Further, good changes in code do not break its syntactic structure. Thus, deploying state-of-the-art neural network models without adapting the methods to the source code domain yields sub-optimal results. To this end, we propose a novel tree-based neural network system to model source code changes and learn code change patterns from the wild. Specifically, we propose a tree-based neural machine translation model to learn the probability distribution of changes in code. We realize our model with a change suggestion engine, CODIT, and train the model with more than 24k real-world changes and evaluate it on 5k patches. Our evaluation shows the effectiveness of CODITin learning and suggesting patches. CODIT can also learn specific bug fix pattern from bug fixing patches and can fix 25 bugs out of 80 bugs in Defects4J.

研究动机与目标

为解决通过学习真实世界代码演化数据中的模式来自动化重复性代码变更的挑战。
通过将代码建模为树结构来克服标准神经网络在代码翻译中的局限性，以保留语法结构。
开发一种能够学习并建议在语义和语法上均有效且无需依赖预定义模板的代码变更的系统。
通过标准基准测试评估模型在真实世界错误修复场景中的泛化能力。

提出的方法

该模型采用树到树的神经机器翻译框架，将源代码AST映射为修改后的AST，以保留语法结构。
采用编码器-解码器架构，并使用树形长短期记忆（Tree-LSTM）网络对AST进行编码和解码。
在从开源代码仓库中提取的24,000个真实世界代码变更上进行端到端训练。
一个变更建议引擎，即CODIT，将训练好的模型应用于代码编辑过程中，生成补丁建议。
该模型从演化数据中学习，捕捉常见的重构与错误修复模式，而无需硬编码模板。
评估在5,000个真实补丁上进行，包括80个Defects4J错误修复示例。

实验结果

研究问题

RQ1神经模型能否直接从真实世界代码演化数据中学习到有意义且语法正确的代码变更模式？
RQ2与基于模板或基于序列的模型相比，基于树的神经翻译模型在未见代码变更上的泛化能力如何？
RQ3该模型在多大程度上能够从标准基准测试的补丁中学习并应用特定的错误修复模式？
RQ4在真实世界开发场景中，该模型能否建议准确且语法正确的代码变更？

主要发现

CODIT通过将代码建模为抽象语法树，成功学习并建议了具有高度语法保真度的代码变更。
该模型在包含5,000个补丁的保留测试集上表现优异，证明其对未见变更的泛化能力。
CODIT在Defects4J基准测试中修复了80个错误中的25个，显示出其在真实世界错误修复场景中的有效性。
基于树的架构相比标准基于序列的神经模型，能更好地保留代码结构。
该模型从原始代码演化数据中学习了可重用的变更模式，而无需人工维护模板。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。