QUICK REVIEW

[论文解读] CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

Erik Nijkamp, Bo Pang|arXiv (Cornell University)|Mar 25, 2022

Software Engineering Research被引用 234

一句话总结

CodeGen 发布开源的语言模型，参数最多至 16.1B，训练数据包括自然语言与编程数据；展示多轮程序合成，并引入多轮编程基准测试（MTPB）。

ABSTRACT

Program synthesis strives to generate a computer program as a solution to a given problem specification, expressed with input-output examples or natural language descriptions. The prevalence of large language models advances the state-of-the-art for program synthesis, though limited training resources and data impede open access to such models. To democratize this, we train and release a family of large language models up to 16.1B parameters, called CODEGEN, on natural language and programming language data, and open source the training library JAXFORMER. We show the utility of the trained model by demonstrating that it is competitive with the previous state-of-the-art on zero-shot Python code generation on HumanEval. We further investigate the multi-step paradigm for program synthesis, where a single program is factorized into multiple prompts specifying subproblems. To this end, we construct an open benchmark, Multi-Turn Programming Benchmark (MTPB), consisting of 115 diverse problem sets that are factorized into multi-turn prompts. Our analysis on MTPB shows that the same intent provided to CODEGEN in multi-turn fashion significantly improves program synthesis over that provided as a single turn. We make the training library JAXFORMER and model checkpoints available as open source contribution: https://github.com/salesforce/CodeGen.

研究动机与目标

通过发布开源训练库和检查点，使大规模代码模型的获取更加民主化。
研究多轮自然语言表达是否相较单轮提示能够提升程序合成的质量。
定量分析模型规模和数据规模如何影响多轮程序合成能力。
引入并验证多轮编程基准测试（MTPB），用于评估多轮合成性能。

提出的方法

在混合自然语言和编程语言语料库上训练自回归变换器（ThePile、BigQuery、BigPython）。
使用顺序式训练方案：在 ThePile 上预训练 CodeGen-NL，在 BigQuery 上训练 CodeGen-Multi，随后在 BigPython 上进行 CodeGen-Mono。
在 HumanEval 上评估单轮程序合成，并与开源基线和 Codex 风格模型进行比较。
提出一个多轮提示框架，并构建包含互相穿插提示与子程序的 115 任务的 MTPB。
通过提示困惑度来评估对用户意图理解的代理指标。
开源训练库 JAXformer，并提供模型检查点以确保可复现性。

实验结果

研究问题

RQ1随着模型和数据规模的扩大，基于自然语言与代码训练的大型语言模型是否会出现新兴的多轮程序合成能力？
RQ2将用户意图分解为多轮自然语言提示是否比单轮说明能提高程序合成质量？
RQ3多轮范式在不同模型规模和代码数据量下的表现如何？
RQ4提示困惑度对生成程序的成功率有何影响？

主要发现

CodeGen 模型在 Python 代码生成任务上达到与开源基线竞争性甚至优越的表现，更大规模的单语 Python 模型接近或超过某些 Codex 变体。
多语言训练（CodeGen-Multi）相对于仅 NL 的模型带来显著提升，且面向 Python 的微调（CodeGen-Mono）进一步提升合成性能。
多轮提示显著提高通过率，相较于拼接的单轮提示，跨越不同模型规模，尤其在更难的问题上。
提示困惑度与成功率相关：较低困惑度的提示往往产生更高的功能准确性。
程序合成能力随模型规模和数据规模而出现并扩展，指示多轮代码生成的扩展规律。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。