QUICK REVIEW

[论文解读] CodeT5+: Open Code Large Language Models for Code Understanding and Generation

Yue Wang, Hung Lê|arXiv (Cornell University)|May 13, 2023

Natural Language Processing Techniques被引用 27

一句话总结

CodeT5+ 提供一个编码器-解码器家族的开源代码大语言模型，能够灵活融合多种预训练目标，并且可以从现成的 LLMs 初始化，在各种代码理解和生成任务上达到 SoTA，包括指令微调变体。

ABSTRACT

Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations in terms of architecture and pretraining tasks. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks. The former paradigm is limited by inflexibility in applications while in the latter, the model is treated as a single system for all tasks, leading to suboptimal performance on a subset of tasks. Secondly, they often employ a limited set of pretraining objectives which might not be relevant to some downstream tasks and hence result in substantial performance degrade. To address these limitations, we propose ``CodeT5+'', a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. Such flexibility is enabled by our proposed mixture of pretraining objectives to mitigate the pretrain-finetune discrepancy. These objectives cover span denoising, contrastive learning, text-code matching, and causal LM pretraining tasks, on both unimodal and bimodal multilingual code corpora. Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMs without training from scratch to efficiently scale up our models, and explore instruction-tuning to align with natural language instructions. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning. We observe state-of-the-art (SoTA) model performance on various code-related tasks, such as code generation and completion, math programming, and text-to-code retrieval tasks. Particularly, our instruction-tuned CodeT5+ 16B achieves new SoTA results on HumanEval code generation task against other open code LLMs.

研究动机与目标

解决代码 LLMs 固定架构（仅编码器、仅解码器，或统一的编码-解码器）及预训练任务受限的问题。
开发一个灵活的、模块化的编码-解码框架，支持对代码任务的零样本、微调和指令微调。
提出混合预训练目标体系（ span denoising、CLM、文本-代码对比学习、文本-代码匹配）以弥合预训练和微调之间的差距。
利用现成 LLMs 的计算高效初始化来扩大规模，而无需从头训练。
在 20+ 个基准测试、跨多语言和任务上展示出色表现，包括指令微调在开放和部分闭源模型上超越。

提出的方法

提出 CodeT5+，作为可在编码器-仅、解码器-仅以及编码器-解码器模式下运行的代码 LLM。
采用两阶段预训练：阶段1 仅模态的代码预训练，目标包括 span denoising 和因果语言模型；阶段2 双模态文本-代码预训练，结合对比学习、匹配和文本-代码因果语言模型。
使用冻结的现成 LLMs，通过一个浅层编码器和一个具备跨注意力连接的深度解码器来进行初始化，以实现高效扩展。
在每个阶段用多种目标损失进行训练，并使用合成自然语言指令数据进行指令微调，以对齐下游任务。
在 9 种编程语言的 20+ 个基准测试中，以零样本、微调和指令微调设置进行评估。
与编码器-仅、解码器-仅以及编码器-解码器基线进行比较，包括开源和闭源模型。

实验结果

研究问题

RQ1一个灵活的模块化体系结构是否能在代码理解和生成任务上同时超越固定架构的 LLM？
RQ2混合预训练目标是否能改善跨任务迁移并减少预训练与微调之间的不匹配？
RQ3用浅编码器和深解码器从现成的 LLMs 初始化，是否能在不进行全量预训练的情况下实现可扩展且计算高效的扩展？
RQ4相对于开源和闭源模型，指令微调的 CodeT5+ 模型在人工评估基准上获得的增益是多少？
RQ5CodeT5+ 各变体在多语言的零样本、微调和基于检索的生成设置中的表现如何？

主要发现

模型	模型大小	pass@1	pass@10	pass@100
CodeT5+	220M	12.0	20.7	31.6
CodeT5+	770M	15.5	27.2	42.7
CodeT5+	2B	24.2	38.2	57.8
CodeT5+	6B	28.0	47.2	69.8
CodeT5+	16B	30.9	51.6	76.7
InstructCodeT5+	16B	35.0	54.5	77.9
CodeT5+ w/ CodeT	16B	38.5	63.6	77.1
InstructCodeT5+ w/ CodeT	16B	42.9	67.8	78.7

CodeT5+ 在多项代码任务上取得类似 SoTA 的结果，包括带指令微调的零样本 HumanEval 生成。
指令微调的 CodeT5+ 16B 在 HumanEval 上达到 35.0% pass@1 和 54.5% pass@10，超过开源 LLM，甚至超过一些闭源模型。
较小的 CodeT5+ 变体（如 220M–770M）在若干任务上与更大尺寸的开源 LLM 相匹配或表现更佳。
使用冻结的深解码器和浅编码器的计算高效预训练，使参数扩展到 16B 时可训练参数有限。
使用 CodeT5+ 的检索增强生成设置显著提升代码生成，优于类似方法。
CodeT5+ 在零样本、微调和指令微调方面都展现出色表现，在文本到代码检索、代码补全和数学编程任务中尤有显著提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。