QUICK REVIEW

[论文解读] MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution

Wei Tao, Yucheng Zhou|arXiv (Cornell University)|Mar 26, 2024

Scientific Computing and Data Management被引用 8

一句话总结

MAGIS 使用一个四代理框架（Manager、Repository Custodian、Developer、QA Engineer）来协调 LLM 以解决 GitHub 问题，在 SWE-bench 上的 issue 解决率明显高于强基线。

ABSTRACT

In software development, resolving the emergent issues within GitHub repositories is a complex challenge that involves not only the incorporation of new code but also the maintenance of existing code. Large Language Models (LLMs) have shown promise in code generation but face difficulties in resolving Github issues, particularly at the repository level. To overcome this challenge, we empirically study the reason why LLMs fail to resolve GitHub issues and analyze the major factors. Motivated by the empirical findings, we propose a novel LLM-based Multi-Agent framework for GitHub Issue reSolution, MAGIS, consisting of four agents customized for software evolution: Manager, Repository Custodian, Developer, and Quality Assurance Engineer agents. This framework leverages the collaboration of various agents in the planning and coding process to unlock the potential of LLMs to resolve GitHub issues. In experiments, we employ the SWE-bench benchmark to compare MAGIS with popular LLMs, including GPT-3.5, GPT-4, and Claude-2. MAGIS can resolve 13.94% GitHub issues, significantly outperforming the baselines. Specifically, MAGIS achieves an eight-fold increase in resolved ratio over the direct application of GPT-4, the advanced LLM.

研究动机与目标

调查为何 LLMs 在仓库级 GitHub 问题解决上遇到困难，并识别影响因素。
提出 MAGIS，一个基于 LLM 的多代理框架，用于在软件仓库中协调问题解决。
在 SWE-bench 的 with-Oracle 和 without-Oracle 设置下，对比强基线评估 MAGIS。
分析影响 GitHub 问题解决成功的规划与编码因素。

提出的方法

在 with-Oracle 与 without-Oracle 设置下对 LLM 性能进行实证分析，以识别诸如行位置和代码改动复杂度等因素。
设计具备四种代理类型并具备协同规划与编码工作流的 MAGIS。
开发用于定位相关文件、组建团队与启动规划的算法。
在 LLM 提示引导下进行迭代的代码修改和 QA 审查，以生成仓库级更改。
在 SWE-bench 上基于 applied 与 resolved 比例，对 GPT-3.5、GPT-4、Claude-2 与 SWE-Llama 基线进行评估。

实验结果

研究问题

RQ1在 with-Oracle 的 GitHub 问题解决中，哪些因素影响 LLM 的性能（例如行位置、代码改动复杂度）？
RQ2在 without-Oracle 的解决中，哪些因素影响 LLM 的性能（例如文件检索/召回）？
RQ3与单一 LLM 基线相比，MAGIS 多代理框架如何提升 GitHub 问题的解决效果？
RQ4规划和 QA 流程对解决效果有何影响？

主要发现

方法	% 应用	% 解决
GPT-3.5	11.67	0.84
Claude-2	49.36	4.88
GPT-4	13.24	1.74
SWE-Llama 7b	51.56	2.12
SWE-Llama 13b	49.13	4.36
Devin [34]	-	13.86
MAGIS	97.39	13.94
MAGIS (w/o QA)	92.71	10.63
MAGIS (w/o hints)	94.25	10.28
MAGIS (w/o hints, w/o QA)	91.99	8.71

MAGIS 实现了 13.94% 的 resolved ratio，约为 GPT-4 基线性能的八倍。
MAGIS 在 SWE-bench 的 applied 与 resolved 指标上均显著优于 GPT-4 和 Claude-2。
行位置准确性与解决概率呈正相关，特别是对 Claude-2。
代码改动复杂度与分辨率呈负相关，涵盖 GPT-3.5、GPT-4 和 Claude-2。
具备规划与 QA 的 MAGIS 变体显示出显著提升；若无 QA 或提示，解决仍高于 GPT-4 但低于完整 MAGIS。
消融研究表明 QA 与人工提示进一步提升性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。