QUICK REVIEW

[论文解读] CodeGen2: Lessons for Training LLMs on Programming and Natural Languages

Erik Nijkamp, Hiroaki Hayashi|arXiv (Cornell University)|May 3, 2023

Software Engineering Research被引用 35

一句话总结

CodeGen2 实验统一架构、学习目标、填充采样和数据分布，以训练用于代码和自然语言的 LLMs；发现前缀语言模型的效益不清晰，填充不是免费，混合数据有前景，多轮训练有效，且提供开源配方。

ABSTRACT

Large language models (LLMs) have demonstrated remarkable abilities in representation learning for program synthesis and understanding tasks. The quality of the learned representations appears to be dictated by the neural scaling laws as a function of the number of model parameters and observations, while imposing upper bounds on the model performance by the amount of available data and compute, which is costly. In this study, we attempt to render the training of LLMs for program synthesis more efficient by unifying four key components: (1) model architectures, (2) learning methods, (3) infill sampling, and, (4) data distributions. Specifically, for the model architecture, we attempt to unify encoder and decoder-based models into a single prefix-LM. For learning methods, (i) causal language modeling, (ii) span corruption, (iii) infilling are unified into a simple learning algorithm. For infill sampling, we explore the claim of a "free lunch" hypothesis. For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored. We conduct a comprehensive series of empirical experiments on 1B LLMs, for which failures and successes of this exploration are distilled into five lessons. We will provide a final recipe for training and release CodeGen2 models in size 1B, 3.7B, 7B, and, 16B parameters, along with the training framework as open-source: https://github.com/salesforce/CodeGen.

研究动机与目标

通过将设计选择统一来降低用于编程和自然语言任务的 LLMs 训练成本的动机。
评估 Prefix-LM 架构是否能够在不牺牲性能的情况下统一编码器式和解码器式的能力。
评估填充采样是否在计算成本上实现零额外成本或微不足道的额外成本（“免费的午餐”）。
研究混合自然语言和编程语言数据以及多轮训练对模型性能的影响。
提供一个开源、实用的训练配方，并发布多种规模的 CodeGen2 模型。

提出的方法

将架构、学习目标、从左到右与填充采样以及数据分布统一到一个单一的配方中。
研究 Prefix-LM 作为潜在的编码器-解码器统一方式，并在编程和语言基准上进行评估。
使用因果语言建模与 span corruption 的混合作为学习目标。
对 infill sampling 进行实验并检验 ‘free lunch’ 假设。
在混合域数据（自然语言+编程）和多轮训练下进行实验，以评估性能提升。

实验结果

研究问题

RQ1在各评估设置中，Prefix-LM 是否相对于因果解码器在代码与语言任务上提供可衡量的好处？
RQ2在计算量和性能方面，infill sampling 真的是毫无成本吗（“free lunch” 假设）？
RQ3简单的因果语言建模与 span corruption 的混合是否在生成和理解任务上都表现良好？
RQ4混合自然语言和编程语言以及多轮训练对跨领域性能有什么影响？

主要发现

Prefix-LM 的好处在各任务上并未清晰体现；性能取决于任务且并不始终优于因果解码器。
infill sampling 未带来明显的 free-lunch 优势；在某些设置中加入 infill 时，HumanEval 的 pass@1 略有下降。
一个简单的混合目标（因果语言建模加 span corruption）对于 left-to-right 和 infill sampling 具有竞争力，尽管在本研究中 UL2 风格的目标未超越因果基线。
混合自然语言和编程语言数据显示出有希望的结果，多轮训练带来显著提升，特别在 CodeGen2.5 中明确显现。
作者提供开源训练代码，计划在训练收敛时发布 CodeGen2 模型（1B、3.7B、7B、16B）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。