QUICK REVIEW

[论文解读] CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning

H Le, Yue Wang|arXiv (Cornell University)|Jul 5, 2022

Software Engineering Research被引用 87

一句话总结

CodeRL 将预训练代码语言模型 CodeT5 与使用单元测试信号来完善代码生成的 actor-critic 强化学习框架相结合。它在 APPS 上实现了最先进的结果，并在 MBPP 上实现了强劲的零-shot 转移。

ABSTRACT

Program synthesis or code generation aims to generate a program that satisfies a problem specification. Recent approaches using large-scale pretrained language models (LMs) have shown promising results, yet they have some critical limitations. In particular, they often follow a standard supervised fine-tuning procedure to train a code generation model only from the pairs of natural-language problem descriptions and ground-truth programs. Such paradigm largely ignores some important but potentially useful signals in the problem specification such as unit tests, which thus often results in poor performance when solving complex unseen coding tasks. To address the limitations, we propose "CodeRL", a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning (RL). Specifically, during training, we treat the code-generating LM as an actor network, and introduce a critic network that is trained to predict the functional correctness of generated programs and provide dense feedback signals to the actor. During inference, we introduce a new generation procedure with a critical sampling strategy that allows a model to automatically regenerate programs based on feedback from example unit tests and critic scores. For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives, larger model sizes, and better pretraining data. Our method not only achieves new SOTA results on the challenging APPS benchmark, but also shows strong zero-shot transfer capability with new SOTA results on the simpler MBPP benchmark.

研究动机与目标

通过利用来自单元测试的功能信号，推动超越标准监督微调的程序合成改进。
提出一个 actor-critic RL 框架，其中 actor 是一个预训练的代码语言模型，critic 预测功能正确性。
通过更大规模的数据和下一个-token 预测目标改进 CodeT5 的预训练，使其更适合生成任务。
引入一个新颖的推理时生成过程，利用单元测试反馈和 critic 指导来重新生成或修复程序。

提出的方法

将代码生成语言模型视为强化学习设定中的 actor，并采样合成代码序列。
训练一个 critic 以预测单元测试结果（CompileError、RuntimeError、FailedTest、PassedTest），并使用其隐藏状态来估计令牌级值。
从单元测试反馈中定义 RL 回报并应用基线以稳定训练。
在生成过程中纳入 critic 的中间回报，以提供令牌级指导。
在推断时实现一个带评议的生成过程，利用示例单元测试和 critic 得分来改进或修复输出。

实验结果

研究问题

RQ1如何将单元测试整合到强化学习中，以提高程序合成的功能正确性？
RQ2在 critic 预测单元测试结果的 actor-critic 框架下，与标准微调相比，是否能提升生成？
RQ3通过扩展 CodeT5 的预训练，采用下一个 token 预测和更大规模的 Python 数据，是否能够提升代码生成基准的表现？
RQ4评议引导的生成和推断阶段的程序修复/改进对最终正确性有何影响？
RQ5该方法是否可以在不同的代码生成模型和基准（如 APPS、MBPP）之间移植？

主要发现

在 APPS 上达到最先进的结果，pass@1、pass@5 和 pass@1000 的改进分别超过 2%、6% 和 20%。
在 MBPP 上展示出强劲的零-shot 转移，达到 63.0% 的 pass@80，超过微调后的 GPT-3-7B 基线的 61.4%。
通过更大模型规模和改进的预训练数据/目标扩展 CodeT5，在与更大语言模型相比时取得有竞争力的性能。
基于强化学习的微调，结合单元测试信号，在不同骨干模型上都显著提升性能。
所提出的 critic 采样过程使基于功能正确性信号的程序生成、细化和修复成为可能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。