QUICK REVIEW

[논문 리뷰] MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution

Wei Tao, Yucheng Zhou|arXiv (Cornell University)|2024. 03. 26.

Scientific Computing and Data Management인용 수 8

한 줄 요약

MAGIS는 네 명의 에이전트 프레임워크(Manager, Repository Custodian, Developer, QA Engineer)를 사용하여 GitHub 이슈를 해결하기 위해 LLM을 조정하고 SWE-bench에서 강력한 기준선보다 이슈 해결을 현저히 높임.

ABSTRACT

In software development, resolving the emergent issues within GitHub repositories is a complex challenge that involves not only the incorporation of new code but also the maintenance of existing code. Large Language Models (LLMs) have shown promise in code generation but face difficulties in resolving Github issues, particularly at the repository level. To overcome this challenge, we empirically study the reason why LLMs fail to resolve GitHub issues and analyze the major factors. Motivated by the empirical findings, we propose a novel LLM-based Multi-Agent framework for GitHub Issue reSolution, MAGIS, consisting of four agents customized for software evolution: Manager, Repository Custodian, Developer, and Quality Assurance Engineer agents. This framework leverages the collaboration of various agents in the planning and coding process to unlock the potential of LLMs to resolve GitHub issues. In experiments, we employ the SWE-bench benchmark to compare MAGIS with popular LLMs, including GPT-3.5, GPT-4, and Claude-2. MAGIS can resolve 13.94% GitHub issues, significantly outperforming the baselines. Specifically, MAGIS achieves an eight-fold increase in resolved ratio over the direct application of GPT-4, the advanced LLM.

연구 동기 및 목표

LLMs가 저장소 레벨의 GitHub 이슈 해결에서 왜 어려움을 겪는지 조사하고 영향 요인을 식별한다.
소프트웨어 저장소의 이슈 해결을 조정하기 위한 LLM 기반 다중 에이전트 프레임워크인 MAGIS를 제안한다.
with-Oracle 및 without-Oracle 설정에서 SWE-bench에서 강력한 LLM 기준선에 대해 MAGIS를 평가한다.
GitHub 이슈 해결 성공에 영향을 미치는 계획 및 코딩 요소를 분석한다.

제안 방법

with-Oracle 및 without-Oracle 설정에서 LLM 성능에 대한 경험적 분석을 수행하여 line location(라인 위치) 및 code-change complexity와 같은 요인을 식별한다.
네 가지 에이전트 유형과 그들의 협력적 계획 및 코딩 워크플로우로 MAGIS를 설계한다.
관련 파일 찾기, 팀 구성 및 킥오프 계획을 위한 알고리즘 개발.
저장소 수준의 변경을 생성하기 위해 LLM 프롬프트에 의해 안내되는 반복적 코드 수정 및 QA 검토.
SWE-bench에서 applied 및 resolved 비율을 사용하여 GPT-3.5, GPT-4, Claude-2 및 SWE-Llama 기준선에 대해 평가한다.

실험 결과

연구 질문

RQ1with-Oracle GitHub 이슈 해결에서 LLM 성능에 영향을 미치는 요인은 무엇인가? (예: 라인 위치, 코드 변경 복잡성 등)
RQ2without-Oracle 해상도에서 LLM 성능에 영향을 미치는 요인은 무엇인가? (예: 파일 검색/회상 등)
RQ3단일 LLM 기준선과 비교했을 때 다중 에이전트 MAGIS 프레임워크가 GitHub 이슈 해결을 어떻게 개선하는가?
RQ4계획 및 QA 프로세스가 해결 효율성에 어떤 영향을 미치는가?

주요 결과

방법	% 적용	% 해결
GPT-3.5	11.67	0.84
Claude-2	49.36	4.88
GPT-4	13.24	1.74
SWE-Llama 7b	51.56	2.12
SWE-Llama 13b	49.13	4.36
Devin [34]	-	13.86
MAGIS	97.39	13.94
MAGIS (w/o QA)	92.71	10.63
MAGIS (w/o hints)	94.25	10.28
MAGIS (w/o hints, w/o QA)	91.99	8.71

MAGIS는 13.94%의 resolved 비율을 달성하여 GPT-4 기준선 성능의 여덟 배에 달한다.
MAGIS는 SWE-bench에서 적용치(applied)와 해결지표(resolved) 모두에서 GPT-4 및 Claude-2를 상당히 능가한다.
라인 위치 정확도는 해결 확률과 양의 상관관계를 보이며 특히 Claude-2의 경우에 그렇다.
코드 변경 복잡성은 GPT-3.5, GPT-4, Claude-2 전반에서 해상도와 음의 상관관계를 보인다.
계획 및 QA가 활성화된 MAGIS 변형은 주목할 만한 이점을 보이며; QA 또는 힌트가 없으면 해상도는 GPT-4보다 높지만 전체 MAGIS보다 낮다.
절단 연구는 QA와 인간 힌트가 성능을 더욱 향상시킨다는 것을 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.