QUICK REVIEW

[论文解读] CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Zhangyin Feng, Daya Guo|arXiv (Cornell University)|Feb 19, 2020

Topic Modeling参考文献 27被引用 247

一句话总结

CodeBERT 是一个双模态预训练 Transformer 模型，在自然语言和代码数据上训练，在 NL-code 搜索和代码文档生成方面取得了最新结果，并实现零-shot 的 NL-PL 探针。

ABSTRACT

We present CodeBERT, a bimodal pre-trained model for programming language (PL) and nat-ural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language codesearch, code documentation generation, etc. We develop CodeBERT with Transformer-based neural architecture, and train it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. This enables us to utilize both bimodal data of NL-PL pairs and unimodal data, where the former provides input tokens for model training while the latter helps to learn better generators. We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters. Results show that CodeBERT achieves state-of-the-art performance on both natural language code search and code documentation generation tasks. Furthermore, to investigate what type of knowledge is learned in CodeBERT, we construct a dataset for NL-PL probing, and evaluate in a zero-shot setting where parameters of pre-trained models are fixed. Results show that CodeBERT performs better than previous pre-trained models on NL-PL probing.

研究动机与目标

开发一个统一的预训练模型，覆盖多种编程语言中的自然语言与编程语言。
利用双模态 NL-PL 对和单模态代码数据来学习鲁棒表示。
在自然语言代码搜索和代码文档生成等 NL-PL 任务上证明有效性。
在零-shot 设置下通过 NL-PL 探测研究 CodeBERT 能捕获哪些知识。

提出的方法

使用基于 Transformer 的架构（RoBERTa-base 规模，125M 参数）。
以混合目标进行预训练，结合对双模态 NL-PL 数据的 Masked Language Modeling (MLM) 和利用单模态数据的 Replaced Token Detection (RTD)。
将输入表示为两个片段（NL 和 code），以 [SEP] 标记分隔，使用 [CLS] 进行聚合表示。
在六种语言上进行训练（Python、Java、JavaScript、PHP、Ruby、Go），并使用双模态 NL-PL 对和单模态代码数据。
使用生成器为 RTD 生成看似合理的替代标记，并训练判别器以区分原始标记与替换标记。
微调 CodeBERT 以进行下游 NL-PL 任务，如基于 NL 的代码搜索和代码到文本生成。

实验结果

研究问题

RQ1在 NL-PL 对和单模态代码数据上进行训练的双模态预训练模型，是否比纯 NL 或仅代码的模型在 NL-PL 理解任务上有提升？
RQ2MLM 单独、RTD 单独及其组合对 NL-PL 任务的影响如何？
RQ3CodeBERT 是否能够在多种编程语言上实现代码搜索和代码文档生成的泛化？
RQ4与 RoBERTa 和仅代码预训练模型相比，CodeBERT 在 NL-PL 探测上的表现如何？

主要发现

CodeBERT 在微调后在自然语言代码搜索（CodeSearchNet）上达到最新结果，超过 RoBERTa 和仅代码预训练模型。
使用 MLM+RTD 预训练（从 RoBERTa 初始化）可在跨语言的检索性能中达到最佳效果（例如，总体 Ma-Avg 相对于基线的提升）。
在代码文档生成中，基于 CodeBERT 的编码器获得的 BLEU-4 分数高于 RoBERTa 和仅代码的基线，RTD+MLM 进一步带来增益。
在零-shot 设置下的 NL-PL 探测显示，CodeBERT 在 PL 和 NL 预测任务中均优于 RoBERTa 和仅代码预训练模型。
CodeBERT 在对未在预训练中见过的编程语言（如 C# 的从代码到 NL 生成）上表现出比 RoBERTa 和某些基线更好的泛化，尽管在所有情况下都未超过最先进的 code2seq。
关于 NL 和 PL 探测的案例研究表明，在 RoBERTa 无法正确预测时，CodeBERT 能够正确预测被屏蔽的 NL 和 PL 标记。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。