QUICK REVIEW

[논문 리뷰] Learning Performance-Improving Code Edits

Alexander Shypula, Aman Madaan|arXiv (Cornell University)|2023. 02. 15.

Software Engineering Research인용 수 24

한 줄 요약

이 논문은 C++ 코드의 성능 향상을 위한 대규모 성능 개선 편집 데이터셋 PIE를 생성하고, 검색 기반 프롬프트, 성능 조건화 생성, 그리고 자기 학습 미세 조정이 LLM을 코드 성능 최적화에 신뢰성 있게 적응시키는 방법을 보여주며, gem5 시뮬레이터에서 측정된 속도 향상에서 평균적으로 인간의 최적 성능을 능가한다.

ABSTRACT

With the decline of Moore's law, optimizing program performance has become a major focus of software research. However, high-level optimizations such as API and algorithm changes remain elusive due to the difficulty of understanding the semantics of code. Simultaneously, pretrained large language models (LLMs) have demonstrated strong capabilities at solving a wide range of programming tasks. To that end, we introduce a framework for adapting LLMs to high-level program optimization. First, we curate a dataset of performance-improving edits made by human programmers of over 77,000 competitive C++ programming submission pairs, accompanied by extensive unit tests. A major challenge is the significant variability of measuring performance on commodity hardware, which can lead to spurious "improvements." To isolate and reliably evaluate the impact of program optimizations, we design an environment based on the gem5 full system simulator, the de facto simulator used in academia and industry. Next, we propose a broad range of adaptation strategies for code optimization; for prompting, these include retrieval-based few-shot prompting and chain-of-thought, and for finetuning, these include performance-conditioned generation and synthetic data augmentation based on self-play. A combination of these techniques achieves a mean speedup of 6.86 with eight generations, higher than average optimizations from individual programmers (3.66). Using our model's fastest generations, we set a new upper limit on the fastest speedup possible for our dataset at 9.64 compared to using the fastest human submissions available (9.56).

연구 동기 및 목표

LLMs로 고수준 프로그램 최적화를 연구하기 위한 데이터셋과 프레임워크를 제공.
gem5 시뮬레이터를 사용한 신뢰할 수 있고 재현 가능한 성능 측정을 가능하게 한다.
프롬프트 및 미세 조정 전략을 평가하여 사전 학습된 코드 LLM을 성능 최적화를 위해 적응시킨다.
평균 속도 향상에서 인간의 성능을 능가하는 효과적인 적응 기법을 식별한다.

제안 방법

gem5로 실행 시간이 주석된 PIE를 포함한 성능 향상 편집(PIE) 데이터세트를 정리한다.
gem5 전체 시스템 시뮬레이터를 사용해 결정론적 성능 측정을 얻는다.
instruction prompting, chain-of-thought, dynamic retrieval-based few-shot prompting 등을 포함한 프롬프트 전략을 평가한다.
고품질 하위집합, 성능 조건부 생성, 자기 학습을 통한 합성 데이터 등 미세 조정 접근법을 탐구한다.
최적화 방향을 더 높은 성능의 해로 유도하기 위해 성능 태그를 도입한다.
참신성과 속도 향상을 위해 LLM이 생성한 합성 예제로 데이터를 보강한다.
테스트 세트에서 최적화 프로그램의 비율, 속도 향상, 정확도를 사용해 효과를 측정한다.

실험 결과

연구 질문

RQ1PIE를 사용해 대형 언어 모델이 고수준 코드 최적화 작업에 효과적으로 적응할 수 있는가?
RQ2코드 최적화 시 어떤 프롬프트나 미세 조정 전략이 성능과 정확도를 가장 잘 향상시키는가?
RQ3검색 기반 프롬프트, 성능 조건부 생성, 합성 자기 학습 데이터를 어떤 방식으로 속도 향상을 주도하는가?
RQ4이 설정에서 오픈 모델과 GPT-3.5 같은 폐쇄형 모델 사이의 차이는 어느 정도이며, 적절한 적응으로 오픈 모델이 그것을 좁힐 수 있는가?

주요 결과

시나리오	모델	%최적화	속도 향상	정확도
Human reference	Best Human	100.00%	4.06	100.00%
Human reference	Same Human	100.00%	3.64	100.00%
All Models, Prompt	gpt-3.5 , FS-CoT	43.78%	1.61	93.15%
Open-Source, Retrieval	codellama 34B	42.16%	2.57	77.92%
Black-Box, Retrieval	gpt4	69.03%	3.56	95.90%
Open-Source, FineTune	codellama 13B-PC	66.60%	5.65	71.08%
Black Box, FineTune	gpt-3.5 , SP	87.68%	6.86	95.11%

1,474개의 문제에서 77,967개의 학습 쌍 데이터세트가 성능 최적화를 위한 신뢰할 수 있는 훈련과 평가를 가능하게 한다.
gem5 기반 평가가 결정론적 성능 측정을 제공하여 실제 하드웨어에서 보이는 팬텀 향상을 완화한다.
동적 검색 기반 프롬 prompting이 대조군을 크게 능가하며, 예를 들어 retrieval이 있는 GPT-3.5가 높은 정확도와 속도 향상을 달성한다.
PIE를 이용한 미세 조정은 강한 향상을 가져오며; 성능 조건부 생성은 최적화 성능을 눈에 띄게 향상시킨다.
GPT-3.5 with synthetic self-play data가 테스트 세트에서 보고된 평균 속도 향상 6.86×를 달성했고, 최고 인간 해답 4.06×를 넘어섰다.
Open-code 모델(codellama)이 적절한 조정 전략을 사용할 때 폐쇄형 모델에 근접하거나 대등한 성능을 보일 수 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.