QUICK REVIEW

[論文レビュー] Practical Program Repair in the Era of Large Pre-trained Language Models

Chunqiu Steven Xia, Yuxiang Wei|arXiv (Cornell University)|Oct 25, 2022

Software Engineering Research被引用数 29

ひとこと要約

本論文は、最新の大規模PLMを用いた自動プログラム修復の初の広範な評価を、複数のデータセットと言語に跨って行い、PLMが従来のAPRツールを上回る可能性を示し、より大きなモデルほど一般に性能が高く、インフィリング/サフィックス文脈がパッチ品質を改善することを示しています。

ABSTRACT

Automated Program Repair (APR) aims to help developers automatically patch software bugs. However, current state-of-the-art traditional and learning-based APR techniques face the problem of limited patch variety, failing to fix complicated bugs. This is mainly due to the reliance on bug-fixing datasets to craft fix templates or directly predict potential patches. Large Pre-Trained Language Models (PLMs), trained using billions of text/code tokens, can potentially help avoid this issue. Very recently, researchers have directly leveraged PLMs for APR without relying on any bug-fixing datasets. Meanwhile, such existing work either failed to include state-of-the-art PLMs or was not evaluated on realistic datasets. In this work, we perform the first extensive study on directly applying PLMs for APR. We select 9 recent state-of-the-art PLMs, including both generative and infilling models, ranging from 125M to 20B in size. We designed 3 different repair settings to evaluate the different ways we can use PLMs to generate patches. We apply the PLMs under these repair settings on 5 datasets across 3 different languages and compare different PLMs in the number of bugs fixed, generation speed and compilation rate. Our study demonstrates that directly applying state-of-the-art PLMs can already substantially outperform all existing APR techniques on all our datasets. Among the studied PLMs, the scaling effect exists for APR where larger models tend to achieve better performance. Also, we show for the first time that suffix code after the buggy line (adopted in infilling-style APR) is important in not only generating more fixes but more patches with higher compilation rate. Besides patch generation, the PLMs consider correct patches to be more natural than other ones, and can even be leveraged for effective patch ranking or patch correctness checking.

研究の動機と目的

複数のデータセットと言語にわたって、さまざまな大規模PLMが自動プログラム修復でどのように性能を発揮するかを評価する。
PLMベースのAPRを、最先端の従来型および学習ベースのAPRツールと比較する。
修正設定（完全な関数生成、インフィリング、単一行生成）がパッチ品質と速度にどう影響するかを調査する。
エントロピーなどPLM由来の指標を用いたパッチのランキングと正しさの検証を探究する。
PLMベースのAPRの性能を向上させる実用的な指針を特定する（サンプルサイズ、修正テンプレート）。

提案手法

生成型およびインフィリングモデルを含む9つの大規模PLM（125M–20Bパラメータ）を、Java、Python、C の5つの実世界の修正データセットで評価する。
3つの修正設定: 完全な関数生成、正しいコードのインフィリング、単一行生成。
プロンプトと少数ショットの例を用いて、バグ修正データなしでパッチを生成できるようPLMsを活用する。
各バグにつき複数のパッチを核サンプリング（top-p、温度）で生成し、パッチのエントロピーでランク付けする。
パッチをテストスイートを実行して、もっとも plausible と correct patchesを区別するためにパッチを検証する。
PLMベースのAPRを、20のベースラインAPRツール（学習ベースおよび従来型）と比較する。

実験結果

リサーチクエスチョン

RQ1さまざまなタイプとサイズのPLMsが、各APR設定でデータセットと言語を横断してどのように性能を発揮するか？
RQ2PLMsは実世界のバグに対して最先端のAPRツールを上回るか？
RQ3PLMsはエントロピーを介してパッチのランキングと正しさの検証に効果的に使用できるか？
RQ4どの戦略（より多くのサンプル、修正テンプレートの組み込み）がPLMベースのAPRの性能をさらに向上させるか？

主な発見

より大きなPLMsは、データセット全体でより正確かつもっともらしいパッチを生成する傾向がある（スケーリング効果）。
Codexは、コード重視の事前学習と微調整によって、いくつかの設定で他のモデルより優れていることが多い。
suffix文脈（prefix+suffix）を用いたインフィリングは、修正数とパッチのコンパイル率の両方を向上させる。
suffix文脈が利用可能な場合、インフィリングモデルは単一行およびインフィリングタスクで生成型と比較して優れている。
正しいコードのインフィリングまたは単一行生成を使用すると、完全な関数生成よりも正確-もっともらしいパッチの比率が高い。
パッチ生成速度は大規模モデルで低下するが、Codexは一部データセットで推論が遅くても強力な修復能力を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。