QUICK REVIEW

[论文解读] Unified Pre-training for Program Understanding and Generation

Wasi Uddin Ahmad, Saikat Chakraborty|arXiv (Cornell University)|Mar 10, 2021

Software Engineering Research参考文献 46被引用 40

一句话总结

PLBART 是一个在 Java、Python 和自然语言数据上统一的序列到序列预训练模型，在代码摘要、生成、翻译以及若干判别式程序理解任务中达到最先进或具有竞争力的结果。

ABSTRACT

Code summarization and generation empower conversion between programming language (PL) and natural language (NL), while code translation avails the migration of legacy code from one PL to another. This paper introduces PLBART, a sequence-to-sequence model capable of performing a broad spectrum of program and language understanding and generation tasks. PLBART is pre-trained on an extensive collection of Java and Python functions and associated NL text via denoising autoencoding. Experiments on code summarization in the English language, code generation, and code translation in seven programming languages show that PLBART outperforms or rivals state-of-the-art models. Moreover, experiments on discriminative tasks, e.g., program repair, clone detection, and vulnerable code detection, demonstrate PLBART's effectiveness in program understanding. Furthermore, analysis reveals that PLBART learns program syntax, style (e.g., identifier naming convention), logical flow (e.g., if block inside an else block is equivalent to else if block) that are crucial to program semantics and thus excels even with limited annotations.

研究动机与目标

为软件工程（PLUG 任务）提供动机并实现一个通用的 PL-NL 理解与生成模型。
利用未标注的编程语言和自然语言数据，通过去噪的序列到序列预训练学习可迁移的表征。
在 Java、Python 和 NL 数据上预训练一个多语言编码器-解码器模型，以支持多样的下游任务。
展示预训练模型在生成、翻译和判别任务上优于或可比于任务特定的基线。

提出的方法

采用 BART 风格的编码器-解码器 Transformer，具有 6 层编码器和 6 层解码器（隐藏单元 768，注意力头 12）。
使用去噪自编码对 Java、Python 函数和 StackOverflow 自然语言文本进行预训练，采用三种噪声策略：令牌屏蔽、令牌删除和令牌填充。
通过 SentencePiece（5万个子词单位）进行分词，并使用语言标识符令牌以实现多语言接收。
在预训练期间对模态之间的数据进行上采样/下采样，以平衡 PL 与 NL 数据。
采用混合语言采样方案和多项式抽样分布来应对数据不平衡，总共在多GPU环境下进行 100K 预训练步骤。
针对序列生成（摘要、生成、翻译）和序列分类任务进行微调，输入为任务特定数据并附加语言 id，使用 BLEU、CodeBLEU、EM 和准确率作为评估信号。

实验结果

研究问题

RQ1一个统一的 PL-NL 预训练模型是否能够为编程语言和自然语言学习出鲁棒的表征？
RQ2去噪预训练是否使模型捕捉到程序语法、命名约定和数据流语义等对代码理解至关重要的特征？
RQ3在生成、翻译和判别性编程任务上，统一模型的表现如何，尤其是在标注数据有限的语言中？
RQ4在大量未标注的 PL/NL 数据上进行的预训练是否比仅编码器或仅解码器的基线在 PLUG 任务上有改进？

主要发现

PLBART 在跨多种语言的代码摘要、代码生成和代码翻译方面优于或接近最先进的基线。
PLBART 在判别性任务如程序修复和漏洞/克隆检测上表现出色，表明对程序理解扎实。
消融研究显示 PLBART 在预训练期间能学习到语法和数据流语义，使在有限标注数据下也能进行有效微调。
定性分析表明 PLBART 捕捉到对程序语义至关重要的编程结构、命名约定和数据流模式。
在 Ruby（集合中训练示例最少的语言）上，PLBART 展现出最大的相对增益，表明来自统一预训练方法的强泛化能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。