QUICK REVIEW

[论文解读] Repairing Bugs in Python Assignments Using Large Language Models

Jialu Zhang, José Cambronero|arXiv (Cornell University)|Sep 29, 2022

Software Engineering Research被引用 30

一句话总结

MMAPR 使用在代码上训练的大型语言模型来修复 Python 学生作业中的语法和语义错误，在最新的语法+语义基线之上取得更好表现，尤其在少样本学习时。

ABSTRACT

Students often make mistakes on their introductory programming assignments as part of their learning process. Unfortunately, providing custom repairs for these mistakes can require a substantial amount of time and effort from class instructors. Automated program repair (APR) techniques can be used to synthesize such fixes. Prior work has explored the use of symbolic and neural techniques for APR in the education domain. Both types of approaches require either substantial engineering efforts or large amounts of data and training. We propose to use a large language model trained on code, such as Codex, to build an APR system -- MMAPR -- for introductory Python programming assignments. Our system can fix both syntactic and semantic mistakes by combining multi-modal prompts, iterative querying, test-case-based selection of few-shots, and program chunking. We evaluate MMAPR on 286 real student programs and compare to a baseline built by combining a state-of-the-art Python syntax repair engine, BIFI, and state-of-the-art Python semantic repair engine for student assignments, Refactory. We find that MMAPR can fix more programs and produce smaller patches on average.

研究动机与目标

通过减轻教师和助教工作负担，推动编程教育中可扩展的自动化反馈。
开发一个统一的系统，能够修复入门级 Python 作业中的语法和语义错误。
利用在代码上训练的大型语言模型，结合多模态提示和弱监督，以提升修复质量。
针对强基线对 MMAPR 进行评估并分析设计选择，如程序分块和少样本学习。

提出的方法

将多模态大语言模型（Codex）作为语法与语义修复的核心引擎。
使用带有程序分块器的语法阶段，将可操作的代码片段分离以用于提示。
生成并验证多组语法提示；接受通过语法本体（oracle）验证的修补。
使用自然语言任务描述和测试用例来引导修复的语义阶段；可选地通过测试集相似性，将同侪的解法作为少样本示例。
通过基于测试集向量检索相似的过往错误/正确程序对，来整合少样本学习。
在存在多个有效候选修复时，偏好与原始有错程序的标记距离最小的修复。

实验结果

研究问题

RQ1MMAPR 是否能修复比结合语法和语义修复工具的最先进基线更高比例的有错误的 Python 提交？
RQ2MMAPR 的设计决策（程序分块、迭代查询、少样本学习、多模态提示）对修复率和修补大小有何影响？
RQ3使用在代码上训练的LLMC是否能够实现对学生作业中语法和语义错误的统一处理？
RQ4与教师参考解相比，MMAPR 的修复与原始学生提交的接近程度如何？

主要发现

问题编号	提交数量	MMAPR TED（无少样本）	MMAPR 修复率（无少样本）	MMAPR TED（有少样本）	MMAPR 修复率（有少样本）	BIFI + Refactory 修复率	BIFI + Refactory TED（SD）
2865	11	6.45 (4.74)	100.00	6.45 (4.74)	100.00	100.00	5.28 (4.27)
2868	28	2.75 (2.17)	82.14	2.75 (2.17)	100.00	100.00	1.83 (1.11)
2869	23	2.91 (2.41)	73.91	2.91 (2.41)	100.00	100.00	8.35 (7.00)
2870	27	2.33 (2.18)	85.19	2.33 (2.18)	100.00	100.00	15.74 (23.92)
2872	18	2.39 (1.20)	72.22	2.39 (1.20)	100.00	100.00	7.39 (13.01)
2873	32	2.84 (2.58)	84.38	2.84 (2.58)	90.63	90.63	12.93 (15.47)
2874	16	2.06 (1.84)	87.50	2.06 (1.84)	100.00	100.00	8.50 (11.76)
2875	23	2.78 (2.71)	78.26	2.78 (2.71)	100.00	78.26	11.52 (12.52)
2877	21	2.19 (1.29)	80.95	2.19 (1.29)	100.00	80.95	9.14 (16.79)
2878	25	4.84 (8.58)	0.00	4.84 (8.58)	100.00	40.2	36.32 (59.53)
2879	21	18.86 (21.24)	66.67	18.86 (21.24)	85.71	85.71	132.78 (52.61)
2882	23	17.39 (23.23)	86.96	17.39 (23.23)	91.30	0.00	106.57 (77.57)
2883	5	5.60 (9.74)	80.00	5.60 (9.74)	100.00	40.00	53 (0.00)
2920	10	10.30 (18.68)	50.00	10.30 (18.68)	80.00	0.00	N/A
2921	3	1.67 (0.58)	100.00	1.67 (0.58)	100.00	0.00	N/A

MMAPR 在没有少样本学习的情况下修复了86.71%的程序，超过基线的67.13%。
在少样本学习下，MMAPR 的修复率提升至96.50%。
MMAPR 修补与有错程序的平均标记距离更小（31.29–31.40），低于基线的42.50。
迭代语法查询将修复率从82.87%提高到86.71%。
移除程序分块器会使平均标记距离从5.46增加到9.38，表明分块有助于稳定最小改动。
将多模态提示组合可获得最佳性能；同侪的少样本示例进一步提升修复率。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。