QUICK REVIEW

[論文レビュー] Explainable Automated Debugging via Large Language Model-driven Scientific Debugging

Sungmin Kang, Bei Chen|arXiv (Cornell University)|Apr 5, 2023

Software Engineering Research被引用数 14

ひとこと要約

AutoSD は LLM を用いて Scientific Debugging をデバッガーインターフェースで模倣し、自動デバッグの仮説と説明を生成しつつ、競争力のある修復性能を維持します。人間の研究は、説明が開発者の意思決定を支援することを示唆しています。

ABSTRACT

Automated debugging techniques have the potential to reduce developer effort in debugging, and have matured enough to be adopted by industry. However, one critical issue with existing techniques is that, while developers want rationales for the provided automatic debugging results, existing techniques are ill-suited to provide them, as their deduction process differs significantly from that of human developers. Inspired by the way developers interact with code when debugging, we propose Automated Scientific Debugging (AutoSD), a technique that given buggy code and a bug-revealing test, prompts large language models to automatically generate hypotheses, uses debuggers to actively interact with buggy code, and thus automatically reach conclusions prior to patch generation. By aligning the reasoning of automated debugging more closely with that of human developers, we aim to produce intelligible explanations of how a specific patch has been generated, with the hope that the explanation will lead to more efficient and accurate developer decisions. Our empirical analysis on three program repair benchmarks shows that AutoSD performs competitively with other program repair baselines, and that it can indicate when it is confident in its results. Furthermore, we perform a human study with 20 participants, including six professional developers, to evaluate the utility of explanations from AutoSD. Participants with access to explanations could judge patch correctness in roughly the same time as those without, but their accuracy improved for five out of six real-world bugs studied: 70% of participants answered that they wanted explanations when using repair tools, while 55% answered that they were satisfied with the Scientific Debugging presentation.

研究の動機と目的

自動推論を人間のデバッグプロセスと整合させるために、説明可能な自動デバッグを動機づける。
LLMを用いて仮説とデバッグ実験をデバッガー・インターフェース経由で生成する AutoSD を提案する。
追跡可能な説明を生成しつつ、AutoSD が競争力のあるプログラム修復性能を達成することを示す。
開発者志向のヒューマン・スタディを通じて説明の有用性を評価する。

提案手法

詳細な Scientific Debugging フレームワークで LLM にプロンプトを提示して、仮説と対応するデバッグ実験を生成させる。
提案されたデバッガーコマンドを実行するか、仮説を検証するために編集・実行スクリプトを用いる。
デバッガー結果を用いて Hypothesize-Observe-Conclude の反復を通じて仮説を洗練させ、<DONE> 信号または反復上限に達するまで続ける。
開発者の確認のため、中間の推論過程を追跡するパッチと説明を生成する。

Figure 1 . The pipeline and a real example run of AutoSD , with annotations in black boxes and lightly edited for clarity. Given a detailed description of the scientific debugging concept and a description of the bug (A), AutoSD will generate a hypothesis about what the bug is and construct an exper

実験結果

リサーチクエスチョン

RQ1RQ1 実現可能性: AutoSD は以前の APR 手法と競争力がありつつ説明を提供するか？
RQ2RQ2 デバッガーアブレーション: <DONE> トークンは高い精度と相関するか、予測されたデバッガー出力は性能にどう影響するか？
RQ3RQ3 LLM の差異: 基盤となる LLM が異なると AutoSD の性能はどう変わるか？
RQ4RQ4 開発者の利点: 現実的な設定で説明は開発者がパッチの正しさを判断するのに役立つか？
RQ5RQ5 開発者の受容性: 開発者は説明を受け入れ可能で望ましいと感じるか？

主な発見

ベンチマーク	妥当（テンプレートベース）	妥当（LLM-ベース）	妥当（AutoSD）	正しい（LLM-ベース）	正しい（AutoSD）
ARHE	85.77 ± 4.20	179	189	177	187
Defects4J v1.2	24	41	87	76
Defects4J v2.0	11	28	110	113

AutoSD はベースラインと比較して 3 つのベンチマーク（ARHE, Defects4J v1.2/v2.0）で競争力の修復性能を達成しつつ、説明も提供する。
<DONE> トークンは高い精度を示し、AutoSD が正しいパッチを生成する可能性が高い時を示す。デバッガーを裏付けとした結果が信頼性を向上させる。
より大きいまたはより能力の高い LLM を使用すると AutoSD の性能が向上する傾向があり、実験では ChatGPT が強力なデフォルトとなる。
ヒューマン・スタディでは、説明により現実世界のバグ6件中5件で開発者の正確性が向上し、APR ツールとして参加者に評価された。
APR ツールでの説明を望む参加者は 70%、科学的デバッグの提示に満足した人は 55% だった。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。