QUICK REVIEW

[论文解读] SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos Jimenez-Gomez|arXiv (Cornell University)|May 6, 2024

Multi-Agent Systems and Negotiation被引用 24

一句话总结

SWE-agent 引入了一个代理-计算机接口（ACI），使语言模型代理能够自主执行软件工程任务，在 SWE-bench 和 HumanEvalFix 上实现了最先进的结果。

ABSTRACT

Language model (LM) agents are increasingly being used to automate complicated tasks in digital environments. Just as humans benefit from powerful software applications, such as integrated development environments, for complex tasks like software engineering, we posit that LM agents represent a new category of end users with their own needs and abilities, and would benefit from specially-built interfaces to the software they use. We investigate how interface design affects the performance of language model agents. As a result of this exploration, we introduce SWE-agent: a system that facilitates LM agents to autonomously use computers to solve software engineering tasks. SWE-agent's custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs. We evaluate SWE-agent on SWE-bench and HumanEvalFix, achieving state-of-the-art performance on both with a pass@1 rate of 12.5% and 87.7%, respectively, far exceeding the previous state-of-the-art achieved with non-interactive LMs. Finally, we provide insight on how the design of the ACI can impact agents' behavior and performance.

研究动机与目标

激发对面向 LM 代理的软件工程任务定制接口的需求。
设计并实现一个代理-计算机接口（ACI），以提升 LM 代理在代码搜索、导航、编辑和执行方面的性能。
在 SWE-bench 和 HumanEvalFix 上评估 SWE-agent，以确立相对于基线的性能提升。
分析 ACI 设计选择如何影响代理行为与性能。
开源 SWE-agent，提升可复现性并促进 LM 驱动的软件工程领域的进一步研究。

提出的方法

定义一个专门的 ACI，包含一小组简单、对 LM 友好的动作（搜索/导航、文件查看器、文件编辑、上下文管理）。
加入守护机制，如语法检查器和静态分析（linting），以防止并从编辑错误中恢复。
采用 ReAct 风格循环（think-aloud + action），代理生成一个想法和一个命令，然后处理环境反馈。
利用信息丰富、简明的环境反馈机制和简洁的历史管理，以保持上下文相关性。
在 Linux shell 顶层集成 ACI，同时在需要时保留对标准 Linux 工具的访问。
进行消融实验，研究编辑界面、搜索策略、文件查看器窗口大小和上下文管理对性能的影响。

实验结果

研究问题

RQ1在配备了定制的代理-计算机接口（ACI）时，LM代理在执行软件工程任务方面相较于传统 shell 或基于检索的基线表现如何？
RQ2ACI 的哪些设计原则最能影响 LM 代理在代码搜索、导航、编辑和执行任务中的性能？
RQ3SWE-agent 在 SWE-bench 和 HumanEvalFix 上的定量收益是多少，且该方法在不同 LM 之间的可移植性如何？
RQ4在这种设定下，LM 代理会出现哪些失效模式，守护机制和界面设计如何缓解？

主要发现

模型	% 已解决（SWE-bench）	$ 平均成本（SWE-bench）	% 已解决（SWE-bench Lite）	$ 平均成本（SWE-bench Lite）
RAG w/ GPT-4 Turbo	1.31	0.13	2.67	0.13
RAG w/ Claude 3 Opus	3.79	0.25	4.33	0.25
Shell-only agent w/ GPT-4 Turbo	-	-	11.00	1.46
Shell-only agent w/o Demonstration	-	-	7.33	0.79
SWE-agent w/ GPT-4 Turbo	12.47	1.59	18.00	1.67
SWE-agent w/ Claude 3 Opus	10.46	2.59	13.00	2.18

SWE-agent 与 GPT-4 Turbo 在 Lite 上解决了 SWE-bench 任务的 12.47%（2,294 个实例），而非 SWE-agent 基线为 3.79%，显示出 ACI 设计带来的显著提升。
在 SWE-bench Lite 上，SWE-agent 实现 18.00% 的解决率，而 RAG 基线为 2.67%，但每个实例成本更高。
在 HumanEvalFix 上，SWE-agent 实现 87.7% 的 pass@1，显著超过非 SWE 方案。
一个紧凑、对 LM 友好的搜索/导航/编辑动作集，以及简明、信息丰富的反馈和守护机制（linting），显著提升相较于仅使用 shell 和 No-demo 基线的性能。
消融研究表明，迭代搜索、紧凑的文件编辑和健壮的上下文管理有助于性能提升；更大的文件查看窗口和演示也影响结果。
该方法显示出可移植性：SWE-agent 与 Claude 3 Opus 实现 10.46% 的 SWE-bench 解决率和 13.00% 的 Lite 解决率，表明 ACI 设计与 LM 无关。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。