QUICK REVIEW

[论文解读] Learning Performance-Improving Code Edits

Alexander Shypula, Aman Madaan|arXiv (Cornell University)|Feb 15, 2023

Software Engineering Research被引用 24

一句话总结

tldr: 本文创建了 PIE，即一个用于 C++ 代码的性能改进编辑的大型数据集，并展示了检索式提示、性能条件生成和自对弈微调如何使大型语言模型能够可靠地优化代码在 gem5 仿真器中测得的性能，平均加速超越最佳的人类表现。

ABSTRACT

With the decline of Moore's law, optimizing program performance has become a major focus of software research. However, high-level optimizations such as API and algorithm changes remain elusive due to the difficulty of understanding the semantics of code. Simultaneously, pretrained large language models (LLMs) have demonstrated strong capabilities at solving a wide range of programming tasks. To that end, we introduce a framework for adapting LLMs to high-level program optimization. First, we curate a dataset of performance-improving edits made by human programmers of over 77,000 competitive C++ programming submission pairs, accompanied by extensive unit tests. A major challenge is the significant variability of measuring performance on commodity hardware, which can lead to spurious "improvements." To isolate and reliably evaluate the impact of program optimizations, we design an environment based on the gem5 full system simulator, the de facto simulator used in academia and industry. Next, we propose a broad range of adaptation strategies for code optimization; for prompting, these include retrieval-based few-shot prompting and chain-of-thought, and for finetuning, these include performance-conditioned generation and synthetic data augmentation based on self-play. A combination of these techniques achieves a mean speedup of 6.86 with eight generations, higher than average optimizations from individual programmers (3.66). Using our model's fastest generations, we set a new upper limit on the fastest speedup possible for our dataset at 9.64 compared to using the fastest human submissions available (9.56).

研究动机与目标

提供一个数据集和框架，用于让 LLMs 研究高层次程序优化。
利用 gem5 仿真器实现可靠、可重复的性能测量。
评估提示与微调策略，以使预训练的代码 LLMs 适应性能优化。
识别有效的自适应技术，达到在平均加速方面超越人类表现。

提出的方法

整理 PIE，即来自 CodeNet 的性能改进编辑（PIE）数据集，执行时间通过 gem5 注解。
使用 gem5 全系统模拟器获得确定性的性能测量。
评估提示策略，包括指令提示、思维链（chain-of-thought）、以及动态基于检索的少量示例提示。
探索微调方法：高质量子集、性能条件生成，以及通过自我对弈生成的合成数据。
引入性能标签以引导优化朝向更高性能的解。
用 LLM 生成的合成示例来扩充数据，并筛选新颖性和加速效果。
评估有效性，使用优化程序的百分比、加速比以及在测试集上的正确性。

实验结果

研究问题

RQ1是否可以使用 PIE 将大型语言模型有效地适应于高层次代码优化任务？
RQ2在优化代码时，哪些提示或微调策略能最好地提升性能和正确性？
RQ3基于检索的提示、性能条件生成以及合成自对弈数据在推动加速方面有何比较？
RQ4在这一设定下，开源模型与GPT-3.5等封闭模型之间存在何种差距，是否通过恰当的适配，开源模型能够缩小这一差距？

主要发现

场景	模型	%Opt	Speedup	正确
人工基准	最佳人类	100.00%	4.06	100.00%
人工基准	同一人类	100.00%	3.64	100.00%
所有模型，提示	gpt-3.5 , FS-CoT	43.78%	1.61	93.15%
开源，检索	codellama 34B	42.16%	2.57	77.92%
黑箱，检索	gpt4	69.03%	3.56	95.90%
开源，微调	codellama 13B-PC	66.60%	5.65	71.08%
黑盒，微调	gpt-3.5 , SP	87.68%	6.86	95.11%

来自 1,474 个问题的 77,967 对训练数据集为性能优化提供了可靠的训练与评估。
gem5 基于评估提供确定性的性能测量，减缓在真实硬件上看到的幻象性提升。
动态检索式提示显著超越基线，例如通过检索的 GPT-3.5 达到高正确性和加速。
使用 PIE 进行微调带来显著改进；性能条件生成显著提升优化性能。
GPT-3.5 结合合成自对弈数据，在测试集上达到最高报道平均加速（6.86×），超过最佳人类解（4.06×）。
开源代码模型（codellama）在采用合适的适配策略时，能够接近或赶上封闭模型的性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。