QUICK REVIEW

[论文解读] AceCoder: Utilizing Existing Code to Enhance Code Generation

Jia Li, Yunfei Zhao|arXiv (Cornell University)|Mar 31, 2023

Software Engineering Research被引用 16

一句话总结

AceCoder 将引导式代码生成和示例检索引入提示中，在 MBPP、MBJP 和 MBJSP 基准测试上显著提升多种大语言模型（LLM）和多种语言的代码生成准确性。

ABSTRACT

Large Language Models (LLMs) have shown great success in code generation. LLMs take as the input a prompt and output the code. A key question is how to make prompts (i.e., Prompting Techniques). Existing prompting techniques are designed for natural language generation and have low accuracy in code generation. In this paper, we propose a new prompting technique named AceCoder. Our motivation is that code generation meets two unique challenges (i.e., requirement understanding and code implementation). AceCoder contains two novel mechanisms (i.e., guided code generation and example retrieval) to solve these challenges. (1) Guided code generation asks LLMs first to analyze requirements and output an intermediate preliminary (e.g., test cases). The preliminary is used to clarify requirements and tell LLMs "what to write". (2) Example retrieval selects similar programs as examples in prompts, which provide lots of relevant content (e.g., algorithms, APIs) and teach LLMs "how to write". We apply AceCoder to three LLMs (e.g., Codex) and evaluate it on three public benchmarks using the Pass@k. Results show that AceCoder can significantly improve the performance of LLMs on code generation. (1) In terms of Pass@1, AceCoder outperforms the state-of-the-art baseline by up to 56.4% in MBPP, 70.7% in MBJP, and 88.4% in MBJSP. (2) AceCoder is effective in LLMs with different sizes (i.e., 6B to 13B) and different languages (i.e., Python, Java, and JavaScript). (3) Human evaluation shows human developers prefer programs from AceCoder.

研究动机与目标

由于两个独特的挑战：需求理解和代码实现，推动对代码生成的专业提示的必要性。
提出 AceCoder，包含两种机制——引导式代码生成和示例检索，以应对这些挑战。
在多种 LLM 与编程语言上，利用公开基准和人工评测来证明有效性。
提供消融和设计见解，以展示每个 AceCoder 模块的贡献。

提出的方法

通过使用 Lucene 的 BM25 从语料库检索相似的 <requirement, code> 对。
选择器通过通过 n-gram 覆盖和基于衰减的打分循环（Algorithm 1）来减少冗余，从而筛选检索到的示例。
分析器将检索到的程序转换为 <requirement, preliminary, code> 三元组，方法是从示例中提取 preliminaries（如测试用例）。
提示构建将三元组示例注入提示中，使 LLM 首先生成一个 preliminaries（例如测试用例），然后给出最终代码。
三步代码生成：LLM 使用构建的提示输出一个 preliminaries，随后输出代码。
在 MBPP（Python）、MBJP（Java）、MBJSP（JavaScript）上，对三种基础 LLM（CodeGeeX-13B、CodeGen-6B、InCoder-6B）进行 Pass@k 评估（k=1、3、5）。

实验结果

研究问题

RQ1RQ1：AceCoder 在代码生成方面是否比现有提示技术更准确？
RQ2RQ2：AceCoder 与基于检索的基线相比如何？
RQ3RQ3：人工开发者是否更偏好 AceCoder 生成的代码？
RQ4RQ4：AceCoder 三个模块（Retriever、Selector、Analyzer）各自的贡献是什么？
RQ5RQ5：哪种三模块设计选择能最大化性能？

主要发现

Base model	Prompting	MBPP Pass@1	MBPP Pass@3	MBPP Pass@5	MBJP Pass@1	MBJP Pass@3	MBJP Pass@5	MBJSP Pass@1	MBJSP Pass@3	MBJSP Pass@5
CodeGeeX-13B	Zero-shot prompting	5.20	13.80	19.40	4.46	11.97	18.26	0.20	0.20	0.41
CodeGeeX-13B	CoT prompting	12.60	23.40	30.20	14.40	28.19	33.67	11.35	21.10	25.96
CodeGeeX-13B	Few-shot prompting	20.40	30.60	36.00	16.63	26.17	34.48	11.16	19.88	25.56
CodeGeeX-13B	AceCoder	26.74	36.43	41.13	28.38	36.79	41.54	21.03	31.44	36.04
CodeGen-6B	Zero-shot prompting	10.40	19.40	24.40	14.81	25.76	31.44	8.72	19.67	22.92
CodeGen-6B	CoT prompting	13.00	21.00	26.00	13.59	25.35	31.24	11.56	20.08	24.54
CodeGen-6B	Few-shot prompting	14.60	24.00	30.20	18.25	30.02	34.68	9.94	19.88	23.12
CodeGen-6B	AceCoder	22.83	34.58	40.16	22.45	34.27	40.96	16.45	27.31	32.16
InCoder-6B	Zero-shot prompting	4.20	11.40	16.20	2.23	5.88	9.13	3.65	5.88	8.11
InCoder-6B	CoT prompting	3.99	10.65	15.31	1.83	4.46	7.10	1.22	2.03	4.67
InCoder-6B	Few-shot prompting	12.80	22.80	28.20	10.95	23.53	26.17	12.78	22.52	27.79
InCoder-6B	AceCoder	20.16	31.44	34.10	16.37	29.89	34.74	15.97	27.13	30.65

AceCoder 在 Pass@1 上对 MBPP 提升最高可达 56.4%、对 MBJP 提升最高可达 70.7%、对 MBJSP 提升最高可达 88.4%。
AceCoder 在 Pass@1 上相较于基于检索的基线，提升最高可达 13.1%（MBPP）、23.44%（MBJP）和 15.8%（MBJSP）。
AceCoder 的效果在从 6B 到 13B 的大语言模型规模，以及在 Python、Java 和 JavaScript 上都有所提升。
人工评测显示开发者更倾向 AceCoder 生成的程序在正确性、代码异味和可维护性方面。
消融研究表明三大模块（Retriever、Selector、Analyzer）均对性能提升做出贡献；设计变体也进行了对比。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。