QUICK REVIEW

[論文レビュー] RepairLLaMA: Efficient Representations and Fine-Tuned Adapters for Program Repair

André Silva, Sen Fang|arXiv (Cornell University)|Dec 25, 2023

Software Testing and Debugging Techniques被引用数 16

ひとこと要約

RepairLLaMA は、コード固有の表現と LoRA ベースのパラメータ効率的ファインチューニングを用いた修復アダプター手法を提示し、Defects4J v2 および HumanEval-Java におけるマルチロケーションバグを含む最先端の修復性能を達成します。

ABSTRACT

Automated Program Repair (APR) has evolved significantly with the advent of Large Language Models (LLMs). Fine-tuning LLMs for program repair is a recent avenue of research, with many dimensions which have not been explored. Existing work mostly fine-tune LLMs with naive code representations and does not scale to frontier models. To address this problem, we propose RepairLLaMA, a novel program repair approach that 1) identifies optimal code representations for APR with fine-tuned models, and 2) pioneers state-of-the-art parameter-efficient fine-tuning technique (PEFT) for program repair. This results in RepairLLaMA producing a highly effective `program repair adapter' for fixing bugs with AI. Our experiments demonstrate the validity of both concepts. First, fine-tuning adapters with program repair specific code representations enables the model to use meaningful repair signals and produce better patches. Second, parameter-efficient fine-tuning helps fine-tuning to converge and clearly contributes to the effectiveness of RepairLLaMA in fixing bugs outside the fine-tuning data distribution. Overall, RepairLLaMA correctly fixes 144 Defects4J v2, 109 HumanEval-Java, and 20 GitBug-Java bugs, outperforming all baselines.

研究の動機と目的

ドメイン固有のコード表現を活用することにより、改良された自動プログラム修復（APR）を推進する。
故障局在信号を含む入力/出力表現が修復性能に与える影響を調査する。
APR の文脈で、パラメータ効率的なファインチューニング（LoRA）と完全なファインチューニングを評価する。
事前学習済み LLM に接続された修復アダプターが Java バグ修正の有効性を示す。

提案手法

ベースモデルとしてオープンソースのコード事前学習済み LLM（CodeLLaMA-7B）を選択する。
故障局在信号と元のバグコードを組み込んだ APR 専用の入力および出力コード表現を設計する。
LoRA を用いて修復アダプターを学習させ、ファインチューニングを軽量化（約 4M パラメータ）しつつ、プログラム修復のために LLM を適応させる。
ファインチューニングデータセット（Megadiff）を整理し、長さ制約（≤1024 トークン）を満たす複数の表現ペアに加工する。
Defects4J v2 および HumanEval-Java 上で、妥当性ベース、AST一致、厳密一致の指標を用いて複数の表現ペアを評価し、比較のために BASelines（インフィリングプロンプト、完全ファインチューニング）を併用する。

Figure 1 . Overview of RepairLLaMA. The core novelties of RepairLLaMA are the APR specific code representations and the engineering of an effective program repair adapter that is plugged into the underlying LLM.

実験結果

リサーチクエスチョン

RQ1RQ1: プログラム修復のために LLM をファインチューニングする際に最適なコード表現とは何か？
RQ2RQ2: プログラム修復におけるパラメータ効率的なファインチューニングと完全なファインチューニングはどう比較されるか？
RQ3RQ3: RepairLLaMA は最先端の ChatGPT ベースの APR とどのように比較されるか？

主な発見

コード表現	Defects4J v2 妥当性	Defects4J v2 AST 一致	Defects4J v2 完全一致	HumanEval-Java 妥当性	HumanEval-Java AST 一致	HumanEval-Java 完全一致
IR3 x OR2 (baseline, no fine-tuning)	133	71	52	107	81	71
IR1 x OR1	79	31	29	78	54	52
IR1 x OR3	41	17	15	39	21	21
IR1 x OR4	12	2	2	5	2	2
IR2 x OR2	198	122	121	118	77	69
IR3 x OR2	154	87	84	103	68	63
IR4 x OR2 (RepairLLaMA)	195	125	124	118	82	75

故障局在信号を含むコード表現は、ナイーブな表現を大幅に上回る。
修復専用の表現を用いたファインチューニングは、ベースライン（ファインチューニングなし）を大きく上回り、Defects4J v2 および HumanEval-Java の両方で顕著な改善をもたらす。
RepairLLaMA（IR4xOR2）は最高の結果を達成し、おそらく 195 件の Defects4J v2 バグと 118 件の HumanEval-Java バグを修復し、Defects4J v2 で 125 AST マッチ、124 完全一致を達成。
この APR 設定では LoRA を用いたパラメータ効率的なファインチューニングが完全なファインチューニングを上回る（完全ファインチューニングを用いたIR4xOR2よりRepairLLaMAがいくつかの指標で上回る）。
約 4M パラメータの修復アダプターは、ベースの CodeLLaMA-7B よりも 1600 倍小さいが、最先端の修復性能を提供し、報告された結果では GPT-4 を上回る。

Figure 2 . Buggy code of the multi-location bug Chart-5 represented in our four different input representations.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。