QUICK REVIEW

[论文解读] Structured Chain-of-Thought Prompting for Code Generation

Jia Li, Ge Li|arXiv (Cornell University)|May 11, 2023

Software Engineering Research被引用 17

一句话总结

SCoT 提示促使大语言模型使用程序结构（顺序、分支、循环和输入/输出）来生成中间推理，相较于标准的 CoT 提示，在多个基准和语言上提升代码生成准确性。

ABSTRACT

Large Language Models (LLMs) (e.g., ChatGPT) have shown impressive performance in code generation. LLMs take prompts as inputs, and Chain-of-Thought (CoT) prompting is the state-of-the-art prompting technique. CoT prompting asks LLMs first to generate CoTs (i.e., intermediate natural language reasoning steps) and then output the code. However, CoT prompting is designed for natural language generation and has low accuracy in code generation. In this paper, we propose Structured CoTs (SCoTs) and present a novel prompting technique for code generation, named SCoT prompting. Our motivation is source code contains rich structural information and any code can be composed of three program structures (i.e., sequence, branch, and loop structures). Intuitively, structured intermediate reasoning steps make for structured source code. Thus, we ask LLMs to use program structures to build CoTs, obtaining SCoTs. Then, LLMs generate the final code based on SCoTs. Compared to CoT prompting, SCoT prompting explicitly constrains LLMs to think about how to solve requirements from the view of source code and further the performance of LLMs in code generation. We apply SCoT prompting to two LLMs (i.e., ChatGPT and Codex) and evaluate it on three benchmarks (i.e., HumanEval, MBPP, and MBCPP). (1) SCoT prompting outperforms the state-of-the-art baseline - CoT prompting by up to 13.79% in Pass@1. (2) Human evaluation shows human developers prefer programs from SCoT prompting. (3) SCoT prompting is robust to examples and achieves substantial improvements.

研究动机与目标

通过让推理与代码结构对齐来激发代码生成的改进.
引入结构化的 CoT（SCoT），使用顺序、分支、循环和 IO 结构来构建中间步骤。
提出 SCoT 提示：先生成 SCoT，再实现代码，减少错误传播。
在多个 LLM 和编程语言的基准数据集上评估 SCoT 提示。

提出的方法

定义由顺序、分支、循环和输入/输出结构组成的结构化链路思考（SCoT）。
设计两种提示：一种用于生成 SCoT，另一种从 SCoT 生成代码。
使用两步生成流程并加入调试步骤以减少误差累积。
应用核采样和固定提示以在每个需求生成多个候选（Pass@k 评估）。
将 SCoT 提示与零样本、少样本和 CoT 基线在 HumanEval（Python）、MBPP（Python）和 MBCPP（C++）上进行比较。

实验结果

研究问题

RQ1RQ1: SCoT 提示是否在基准和 LLMs 上相对于基线提高代码生成的准确性（Pass@k）？
RQ2RQ2: 开发者是否更喜欢由 SCoT 提示生成的程序而非基线？
RQ3RQ3: SCoT 提示对示例种子和写作风格的选择是否鲁棒？
RQ4RQ4: 基本程序结构（顺序、分支、循环）与 IO 结构对 SCoT 提示性能的贡献是什么？

主要发现

Benchmark	Base Model	Prompting Technique	Pass@1	Pass@3	Pass@5
HumanEval	ChatGPT	CoT prompting	53.29	69.76	75.52
HumanEval	ChatGPT	SCoT Prompting	60.64	73.53	77.32
MBPP	ChatGPT	CoT prompting	41.83	51.04	54.57
MBPP	ChatGPT	SCoT Prompting	46.98	55.31	58.36
MBCPP	ChatGPT	CoT prompting	53.51	63.84	67.03
MBCPP	ChatGPT	SCoT Prompting	57.06	65.70	68.70

SCoT 提示在 HumanEval 的 Pass@1 相对于 CoT 提示提升最高 13.79%，在 MBPP 提升 12.31%，在 MBCPP 提升 6.63%。
人工评审更倾向于由 SCoT 提示生成的程序，在正确性、代码病态和可维护性方面优于基线。
SCoT 提示在不同 LLM（ChatGPT 与 Codex）和语言（Python、C++）中均表现出提升，并对示例种子和写作风格具有鲁棒性。
消融研究显示三种基本结构再加上 IO 结构都对性能提升有贡献，且基本结构对可解决、结构良好的中间推理有明确帮助。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。