QUICK REVIEW

[논문 리뷰] Self-collaboration Code Generation via ChatGPT

Yihong Dong, Xue Jiang|arXiv (Cornell University)|2023. 04. 15.

Software Engineering Research인용 수 26

한 줄 요약

이 논문은 코드 생성을 위한 가상 팀으로서 ChatGPT의 역할(애널리스트, 코더, 테스터)이 self-collaboration 프레임워크를 구축하여 코드 생성 벤치마크에서 최첨단 성과를 내고 일부 설정에서 GPT-4를 능가하는 것을 보여준다.

ABSTRACT

Although Large Language Models (LLMs) have demonstrated remarkable code-generation ability, they still struggle with complex tasks. In real-world software development, humans usually tackle complex tasks through collaborative teamwork, a strategy that significantly controls development complexity and enhances software quality. Inspired by this, we present a self-collaboration framework for code generation employing LLMs, exemplified by ChatGPT. Specifically, through role instructions, 1) Multiple LLM agents act as distinct `experts', each responsible for a specific subtask within a complex task; 2) Specify the way to collaborate and interact, so that different roles form a virtual team to facilitate each other's work, ultimately the virtual team addresses code generation tasks collaboratively without the need for human intervention. To effectively organize and manage this virtual team, we incorporate software-development methodology into the framework. Thus, we assemble an elementary team consisting of three LLM roles (i.e., analyst, coder, and tester) responsible for software development's analysis, coding, and testing stages. We conduct comprehensive experiments on various code-generation benchmarks. Experimental results indicate that self-collaboration code generation relatively improves 29.9%-47.1% Pass@1 compared to the base LLM agent. Moreover, we showcase that self-collaboration could potentially enable LLMs to efficiently handle complex repository-level tasks that are not readily solved by the single LLM agent.

연구 동기 및 목표

협력적인 LLM 팀워크를 활용하여 복잡한 코드 생성 작업의 어려움을 동기 부여하고 해결한다.
역할을 할당하고 에이전트 간 협업을 정의하여 문제를 해결하는 자기 협업 프레임워크를 제안한다.
소프트웨어 개발 방법론(SDM)을 따르는 기본적인 세 역할 팀(애널리스트, 코더, 테스터)을 구현한다.
여러 벤치마크와 실제와 유사한 작업들에 대한 강건성과 일반성을 입증한다.

제안 방법

역할 지시를 통해 노동 분업(DOL)을 정의하여 전문화된 LLM 전문가를 만든다.
블랙보드를 공유하고 역할 간 조정을 형식화하여 협업을 구현한다(Eq. 1 및 Eq. 2).
세 가지 ChatGPT 역할을 사용하여 분석-코딩-테스트의 워터폴형 SDM을 따르는 기본 팀(애널리스트, 코더, 테스터)을 구현한다.
에이전트 초기화 시 한 번의 역할 고정을 위해 역할 지시를 사용하고, 이후 재프롬프트 없이 상호 작용을 가능하게 한다.
NL-전용 프롬프트와 NL+시그니처+공개 테스트 케이스 설정으로 MBPP, HumanEval, MBPP-ET, HumanEval-ET에서 Pass@k(Pass@1 강조)로 평가한다.
역할 연기 대 비-role 프롬프트의 영향과 상호 작용 라운드(MI)의 효과를 탐구한다.

실험 결과

주요 결과

접근 방법	HumanEval	HumanEval-ET	MBPP	MBPP-ET
Direct	57.3	42.7	52.2	36.8
Self-collaboration (Virtual Team)	74.4	56.1	68.2	49.5

자기 협업은 직접 생성을 넘어서 Pass@1에서 29.9%–47.1%의 성능 향상을 가져온다.
세 역할의 기본 팀(애널리스트, 코더, 테스터)은 네 가지 코드 생성 벤치마크에서 최첨단 결과를 달성하며 때때로 GPT-4를 능가한다.
역할 연기가 NL 주도 프롬프트에서 비역할 프롬프트 기반의 베이스라인보다 크게 뛰어나다.
상호 작용(더 많은 피드백 라운드)은 초기 라운드 이후에 수익이 감소하지만 여전히 복잡한 작업에서 일관된 이점을 제공한다.
확장 테스트 벤치마크(HumanEval-ET 및 MBPP-ET)에서 특히 유익하며 경계 케이스와 버그 처리에 더 나은 성능을 나타낸다.
사례 연구는 프레임워크가 복잡한 실제 작업(예: 파이썬 게임)을 자율적으로 해결하는 것을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.