QUICK REVIEW

[Paper Review] CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

Erik Nijkamp, Bo Pang|arXiv (Cornell University)|Mar 25, 2022

Software Engineering Research234 citations

TL;DR

CodeGen releases open-source LLMs up to 16.1B parameters trained on natural language and programming data; demonstrates multi-turn program synthesis and introduces the Multi-Turn Programming Benchmark (MTPB).

ABSTRACT

Program synthesis strives to generate a computer program as a solution to a given problem specification, expressed with input-output examples or natural language descriptions. The prevalence of large language models advances the state-of-the-art for program synthesis, though limited training resources and data impede open access to such models. To democratize this, we train and release a family of large language models up to 16.1B parameters, called CODEGEN, on natural language and programming language data, and open source the training library JAXFORMER. We show the utility of the trained model by demonstrating that it is competitive with the previous state-of-the-art on zero-shot Python code generation on HumanEval. We further investigate the multi-step paradigm for program synthesis, where a single program is factorized into multiple prompts specifying subproblems. To this end, we construct an open benchmark, Multi-Turn Programming Benchmark (MTPB), consisting of 115 diverse problem sets that are factorized into multi-turn prompts. Our analysis on MTPB shows that the same intent provided to CODEGEN in multi-turn fashion significantly improves program synthesis over that provided as a single turn. We make the training library JAXFORMER and model checkpoints available as open source contribution: https://github.com/salesforce/CodeGen.

Motivation & Objective

Democratize access to large-scale code models by releasing open-source training library and checkpoints.
Investigate whether multi-turn natural language specifications improve program synthesis over single-turn prompts.
Quantitatively analyze how model and data scale influence multi-turn program synthesis capacity.
Introduce and validate the Multi-Turn Programming Benchmark (MTPB) to evaluate multi-turn synthesis performance.

Proposed method

Train autoregressive transformers on a mixed natural language and programming language corpus (ThePile, BigQuery, BigPython).
Use a sequential training regime: CodeGen-NL on ThePile, then CodeGen-Multi on BigQuery, followed by CodeGen-Mono on BigPython.
Evaluate single-turn program synthesis on HumanEval and compare against open baselines and Codex-style models.
Propose a multi-turn prompt framework and construct the 115-task MTPB with interleaved prompts and subprograms.
Assess prompt understanding via prompt perplexity as a proxy for user intent comprehension.
Open-source the training library JAXformer and provide model checkpoints for reproducibility.

Experimental results

Research questions

RQ1Can large language models trained on natural language and code exhibit emergent multi-turn program synthesis capabilities as model and data scale increase?
RQ2Does factorizing user intent into multiple natural language turns improve program synthesis quality compared with single-turn specifications?
RQ3How does the multi-turn paradigm perform across model sizes and code-data volumes?
RQ4What is the impact of prompt perplexity on the success rate of generated programs?

Key findings

CodeGen models reach competitive or superior performance to open-source baselines on Python code generation tasks, with larger monolingual Python models approaching or surpassing some Codex variants.
Multilingual training (CodeGen-Multi) yields substantial improvements over NL-only models, and Python-focused fine-tuning (CodeGen-Mono) further boosts synthesis performance.
Multi-turn prompts significantly improve pass rates over concatenated single-turn prompts across model sizes, particularly for harder problems.
Prompt perplexity correlates with success: lower perplexity prompts tend to yield higher functional accuracy.
Program synthesis capacity emerges and scales with model size and data size, indicating a scaling law for multi-turn code generation.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.