QUICK REVIEW

[論文レビュー] Agentless: Demystifying LLM-based Software Engineering Agents

Chunqiu Steven Xia, Yinlin Deng|arXiv (Cornell University)|Jul 1, 2024

Multi-Agent Systems and Negotiation被引用数 13

ひとこと要約

Agentlessは、LLMsを用いてSWE-bench Liteの課題を解くエージェントレスの2相プローチ（ローカリゼーションとリペア）を提案し、低コストで競争力のある性能を達成するとともに、ベンチマークの問題点を浮き彫りにする。

ABSTRACT

Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents are equipped with the ability to use tools, run commands, observe feedback from the environment, and plan for future actions. However, the complexity of these agent-based approaches, together with the limited abilities of current LLMs, raises the following question: Do we really have to employ complex autonomous software agents? To attempt to answer this question, we build Agentless -- an agentless approach to automatically solve software development problems. Compared to the verbose and complex setup of agent-based approaches, Agentless employs a simplistic three-phase process of localization, repair, and patch validation, without letting the LLM decide future actions or operate with complex tools. Our results on the popular SWE-bench Lite benchmark show that surprisingly the simplistic Agentless is able to achieve both the highest performance (32.00%, 96 correct fixes) and low cost ($0.70) compared with all existing open-source software agents! Furthermore, we manually classified the problems in SWE-bench Lite and found problems with exact ground truth patch or insufficient/misleading issue descriptions. As such, we construct SWE-bench Lite-S by excluding such problematic issues to perform more rigorous evaluation and comparison. Our work highlights the current overlooked potential of a simple, interpretable technique in autonomous software development. We hope Agentless will help reset the baseline, starting point, and horizon for autonomous software agents, and inspire future work along this crucial direction.

研究の動機と目的

LLMベースのソフトウェア工学タスクに対して、複雑な自律エージェントが本当に必要かという問いを動機づける。
エンドツーエンドのバグ修正と機能追加のための、シンプルでエージェントレスの2相フレームワーク（ローカリゼーションとリペア）を提案する。
SWE-bench Lite上でこのアプローチを評価し、既存のオープンソースおよび商用エージェントと性能とコストを比較する。
SWE-bench Liteの制限を分析し、より厳密なベンチマークとしてSWE-bench Lite-Sを提案する。

提案手法

2相ワークフロー：ローカリゼーションに続くリペア。
ローカリゼーション：階層的なプロセスで、(a)リポジトリ構造表現を構築、(b)上位N個の疑わしいファイルを特定、(c)クラス/関数宣言を含む各ファイルのスケルトンを導出、(d)正確な編集箇所を絞り込む。
リペア：各編集箇所について、コードの周囲にコンテキストウィンドウを構築し、LLMを用いて複数のパッチ候補を生成し、構文チェックと回帰テストでフィルタリング。
パッチは、編集範囲を最小化し幻覚リスクを低減するため、単純なSearch/Replace diff形式で作成される。
パッチの評価は回帰テストを用いて不適合パッチを除外し、正規化されたパッチに対して多数決投票を行って提出用の最終パッチを選択する。

実験結果

リサーチクエスチョン

RQ1エージェントを用いない2相アプローチが、リポジトリレベルのソフトウェア工学問題を解決する際に、複雑な自律エージェントシステムと同等かそれ以上を達成できるか。
RQ2SWE-bench Liteにおけるエージェントレスデザインとエージェントベースのアプローチのコスト対性能のトレードオフは何か。
RQ3階層的ローカリゼーションが編集場所の精度と全体のパッチ品質にどう影響するか。
RQ4SWE-bench Liteには自動ソフトウェア工学ツールの評価に影響を与えるどのような問題があり、改良版ベンチマーク（SWE-bench Lite-S）は厳密さをどのように改善できるか。

主な発見

AgentlessはSWE-bench Liteで27.33%の解決率（82/300問題）を達成し、1件あたりの平均コストは$0.34、コスト面でオープンソースエージェントを上回り、成功率では競合的。
階層的ローカリゼーションはコンテキストを削減し、ローカリゼーション精度を維持する。真実のファイルの77.7%がローカライズされ、後のステップで段階的に狭められたコンテキスト。
リペア設定は漸増的な改善を示す：単一サンプルのパッチで70件の正解修正を$0.11、複数サンプルと多数決投票で78件の修正を$0.34、テストフィルタリングを含む完全処理で82件の修正（報告されたAgentlessの結果）を得る。
正確なグラウンドトゥルースパッチ、誤解を招く記述、または問題情報が不十分な問題を除外して252問題のサブセットSWE-bench Lite-Sを提案。このサブセットではAgentlessはランキングで競争力を維持。
詳細分析は、説明品質、提供された解決策、位置情報に関連するSWE-bench Liteの問題を明らかにし、ベンチマーク設計の改善の必要性を動機づける。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。