QUICK REVIEW

[论文解读] Agentless: Demystifying LLM-based Software Engineering Agents

Chunqiu Steven Xia, Yinlin Deng|arXiv (Cornell University)|Jul 1, 2024

Multi-Agent Systems and Negotiation被引用 13

一句话总结

Agentless 提出了一种无代理的两阶段方法（定位与修复），用于用大型语言模型解决 SWE-bench Lite 的问题，在低成本下实现具有竞争力的性能，并凸显基准测试中的问题。

ABSTRACT

Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents are equipped with the ability to use tools, run commands, observe feedback from the environment, and plan for future actions. However, the complexity of these agent-based approaches, together with the limited abilities of current LLMs, raises the following question: Do we really have to employ complex autonomous software agents? To attempt to answer this question, we build Agentless -- an agentless approach to automatically solve software development problems. Compared to the verbose and complex setup of agent-based approaches, Agentless employs a simplistic three-phase process of localization, repair, and patch validation, without letting the LLM decide future actions or operate with complex tools. Our results on the popular SWE-bench Lite benchmark show that surprisingly the simplistic Agentless is able to achieve both the highest performance (32.00%, 96 correct fixes) and low cost ($0.70) compared with all existing open-source software agents! Furthermore, we manually classified the problems in SWE-bench Lite and found problems with exact ground truth patch or insufficient/misleading issue descriptions. As such, we construct SWE-bench Lite-S by excluding such problematic issues to perform more rigorous evaluation and comparison. Our work highlights the current overlooked potential of a simple, interpretable technique in autonomous software development. We hope Agentless will help reset the baseline, starting point, and horizon for autonomous software agents, and inspire future work along this crucial direction.

研究动机与目标

激发这样一个问题：在基于LLM的软件工程任务中，是否需要复杂的自治代理。
提出一个简单的、无代理的两阶段框架（定位和修复），用于端到端的错误修复和功能添加。
在 SWE-bench Lite 上评估该方法，以将性能和成本与现有的开源和商业代理进行比较。
分析 SWE-bench Lite 的局限性，并提出 SWE-bench Lite-S 作为一个更严格的基准测试。

提出的方法

两阶段工作流：定位在前，修复在后。
定位：分层过程，(a) 构建存储库结构表示，(b) 识别前 N 个可疑文件，(c) 推导出包含类/函数声明的每个文件骨架，(d) 缩小到精确的编辑位置。
修复：对每个编辑位置，在代码周围构建上下文窗口，使用LLM生成多个补丁候选，并通过语法检查和回归测试进行筛选。
补丁以简单的搜索/替换 diff 格式生成，以最小化编辑范围和减少幻觉风险。
补丁评估使用回归测试筛选出失败的补丁，随后对规范化补丁进行多数投票以选择提交的最终补丁。

实验结果

研究问题

RQ1一个非代理驱动的两阶段方法能否在解决仓库级软件工程问题上达到甚至超过复杂的自治代理系统？
RQ2无代理设计与基于代理的方法在 SWE-bench Lite 上的成本-性能权衡是多少？
RQ3分层定位如何影响编辑位置的精确度以及整体补丁质量？
RQ4SWE-bench Lite 存在哪些影响自治软件工程工具评估的问题，以及如何通过修订基准（SWE-bench Lite-S）来提高严格性？

主要发现

Agentless 在 SWE-bench Lite 上解决了 27.33% 的问题（82/300），平均每个错误成本为 0.34 美元，在成本上优于开源代理，在成功率方面具有竞争力。
分层定位减少上下文并维持定位准确性，实际地面 truth 文件有 77.7% 被定位，后续步骤的上下文逐步变窄。
修复设置显示出增量收益：单样本补丁产生 70 个正确修复，成本 0.11 美元；多样本并进行多数投票产生 78 个修复，成本 0.34 美元；使用测试过滤的完整处理产生 82 个修复（报告的 Agentless 结果）。
提出了子集 SWE-bench Lite-S（252 个问题），通过删除具有精确地面真相补丁、误导性描述或信息不足的问题；在该子集上，Agentless 在排名上仍具有竞争力。
详细分析揭示了 SWE-bench Lite 在描述质量、提供的解决方案和位置信息方面的问题，促使需要改进基准设计。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。