QUICK REVIEW

[论文解读] Language-Agnostic Representation Learning of Source Code from Structure and Context

Daniel Zügner, Tobias Kirschstein|arXiv (Cornell University)|Mar 21, 2021

Software Engineering Research参考文献 50被引用 63

一句话总结

本文提出 Code Transformer，通过从源代码（Context）及其抽象语法树（Structure）中共同学习，使用语言无关的特征，达到最先进的代码摘要效果并实现多语言训练。

ABSTRACT

Source code (Context) and its parsed abstract syntax tree (AST; Structure) are two complementary representations of the same computer program. Traditionally, designers of machine learning models have relied predominantly either on Structure or Context. We propose a new model, which jointly learns on Context and Structure of source code. In contrast to previous approaches, our model uses only language-agnostic features, i.e., source code and features that can be computed directly from the AST. Besides obtaining state-of-the-art on monolingual code summarization on all five programming languages considered in this work, we propose the first multilingual code summarization model. We show that jointly training on non-parallel data from multiple programming languages improves results on all individual languages, where the strongest gains are on low-resource languages. Remarkably, multilingual training only from Context does not lead to the same improvements, highlighting the benefits of combining Structure and Context for representation learning on code.

研究动机与目标

通过结合两种互补视角：将代码视为文本（Context）及其抽象语法树（Structure），激发对程序的有意义表示学习。
开发一个语言无关的 Transformer 模型，在不进行语言特定预处理的情况下整合 Context 和 Structure。
展示在五种语言的单语代码摘要上达到最先进的性能。
引入并评估一个在多语言上训练的多语言代码摘要模型，使用共享词汇表和语言嵌入。

提出的方法

采用基于 Transformer 的架构，在注意力中使用相对距离而非绝对位置。
计算并将来自 AST 的多种关系距离（最短路径、祖先、同级、以及个性化 PageRank）引入注意力机制。
为每种关系使用单独的键投影矩阵，将 Context 与 Structure 的贡献在注意力分数中相加。
通过将每个标记的标记嵌入、其分配的 AST 节点类型的嵌入，以及分词器的标记类型连接起来来表示每个标记。
使用不可训练的正弦编码对相对距离进行编码，以在图上实现结构感知的注意力。
结合指针网络进行训练，使解码器能够指向输入位置，以改进标记预测。

实验结果

研究问题

RQ1一个模型是否可以在不依赖语言特定特征的情况下，从 Context（代码标记）和 Structure（AST）中学习？
RQ2在多种编程语言上共同训练是否能提升代码摘要的性能，尤其是对资源匮乏语言？
RQ3在单语与多语代码摘要中，包含 Structure 相对于仅包含 Context 会带来怎样的影响？
RQ4不同的 AST 距离度量（最短路径、祖先、同级、PageRank）对性能有何贡献？
RQ5当仅使用语言无关特征时，多语言训练是否在多样化语言上优于单语训练？

主要发现

Code Transformer 在五种语言的单语设置下实现了最先进的代码摘要。
多语言训练显著提升所有语言的性能，对资源匮乏语言的提升最显著。
仅 Context 的多语言训练并未带来与结构和上下文结合相同的改进。
消融结果表明 Structure 与 Context 均对性能有贡献，指针网络进一步提升了结果。
将多种 AST 距离度量结合使用的效果优于任一单一距离。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。