QUICK REVIEW

[论文解读] CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Yue Wang, Weishi Wang|arXiv (Cornell University)|Sep 2, 2021

Software Engineering Research参考文献 27被引用 23

一句话总结

CodeT5 提出了一种在代码与自然语言配对数据上预训练的统一编码器-解码器 Transformer 模型，通过引入标识符感知掩码和双模态双生成机制，提升了代码理解与生成能力。该模型在 14 个 CodeXGLUE 任务中达到最先进性能，显著优于先前方法，在代码缺陷检测与代码摘要任务中表现突出。

ABSTRACT

Pre-trained models for Natural Languages (NL) like BERT and GPT have been recently shown to transfer well to Programming Languages (PL) and largely benefit a broad set of code-related tasks. Despite their success, most current methods either rely on an encoder-only (or decoder-only) pre-training that is suboptimal for generation (resp. understanding) tasks or process the code snippet in the same way as NL, neglecting the special characteristics of PL such as token types. We present CodeT5, a unified pre-trained encoder-decoder Transformer model that better leverages the code semantics conveyed from the developer-assigned identifiers. Our model employs a unified framework to seamlessly support both code understanding and generation tasks and allows for multi-task learning. Besides, we propose a novel identifier-aware pre-training task that enables the model to distinguish which code tokens are identifiers and to recover them when they are masked. Furthermore, we propose to exploit the user-written code comments with a bimodal dual generation task for better NL-PL alignment. Comprehensive experiments show that CodeT5 significantly outperforms prior methods on understanding tasks such as code defect detection and clone detection, and generation tasks across various directions including PL-NL, NL-PL, and PL-PL. Further analysis reveals that our model can better capture semantic information from code. Our code and pre-trained models are released at https: //github.com/salesforce/CodeT5 .

研究动机与目标

为解决编码器-only 或解码器-only 模型在代码相关任务中的局限性，提出一种统一的编码器-解码器框架。
通过显式建模开发者分配的标识符作为关键语义信号，提升代码表征能力。
通过双生成预训练目标，增强自然语言注释与代码之间的跨模态对齐。
在单一统一的模型架构中同时支持代码理解与代码生成任务。
通过使用任务特定提示进行多任务微调，提升在多样化代码智能任务中的泛化能力。

提出的方法

利用 T5 架构作为统一的编码器-解码器框架，实现代码理解与生成的联合建模。
提出一种新颖的标识符感知预训练目标，将标识符与其他标记类型分开进行掩码与重建。
采用双模态双生成任务，利用代码-注释对联合预训练 NL→PL 与 PL→NL 生成任务。
在 CodeSearchNet 和来自 GitHub 的额外 C/C# 代码上进行预训练，使用单模态与双模态数据。
通过使用任务控制代码作为输入提示，在多个 CodeXGLUE 任务上进行微调，以支持多任务学习。
在预训练过程中使用统一的序列到序列去噪目标，以提升在下游任务中的泛化能力。

实验结果

研究问题

RQ1统一的编码器-解码器模型是否能在代码理解与生成任务中均优于专用的编码器-only 或解码器-only 模型？
RQ2在预训练过程中仅掩码标识符是否相比标准掩码能显著提升代码表征学习能力？
RQ3在 NL→PL 与 PL→NL 生成任务上进行联合预训练，是否能提升跨模态对齐效果，并改善代码摘要与翻译任务的性能？
RQ4使用任务特定提示进行多任务微调，如何影响模型在多样化代码智能任务中的泛化能力？
RQ5标识符感知预训练在多大程度上能增强模型对多种编程语言代码语义的理解？

主要发现

CodeT5 在 CodeXGLUE 基准的全部 14 个子任务中均达到最先进性能，显著优于先前方法，在代码理解与生成任务中表现优异。
标识符感知预训练目标在代码缺陷检测与代码克隆检测任务中显著提升性能，表明其具备更强的语义理解能力。
双模态双生成预训练任务在 NL-PL 与 PL-NL 任务（如代码摘要与代码翻译）中带来显著性能提升。
使用任务控制代码进行多任务微调，提升了零样本泛化能力与在多样化代码智能任务中的性能表现。
CodeT5-base（220M 参数）的性能可与参数量更大的模型（如 Codex，120 亿参数）相媲美，展现出极高的效率与有效性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。