[Paper Review] Self-collaboration Code Generation via ChatGPT
The paper introduces a self-collaboration framework where ChatGPT roles (analyst, coder, tester) work as a virtual team to generate code, achieving state-of-the-art results on code-generation benchmarks and even surpassing GPT-4 in some settings.
Although Large Language Models (LLMs) have demonstrated remarkable code-generation ability, they still struggle with complex tasks. In real-world software development, humans usually tackle complex tasks through collaborative teamwork, a strategy that significantly controls development complexity and enhances software quality. Inspired by this, we present a self-collaboration framework for code generation employing LLMs, exemplified by ChatGPT. Specifically, through role instructions, 1) Multiple LLM agents act as distinct `experts', each responsible for a specific subtask within a complex task; 2) Specify the way to collaborate and interact, so that different roles form a virtual team to facilitate each other's work, ultimately the virtual team addresses code generation tasks collaboratively without the need for human intervention. To effectively organize and manage this virtual team, we incorporate software-development methodology into the framework. Thus, we assemble an elementary team consisting of three LLM roles (i.e., analyst, coder, and tester) responsible for software development's analysis, coding, and testing stages. We conduct comprehensive experiments on various code-generation benchmarks. Experimental results indicate that self-collaboration code generation relatively improves 29.9%-47.1% Pass@1 compared to the base LLM agent. Moreover, we showcase that self-collaboration could potentially enable LLMs to efficiently handle complex repository-level tasks that are not readily solved by the single LLM agent.
Motivation & Objective
- Motivate and address the difficulty of complex code generation tasks by leveraging collaborative LLM teamwork.
- Propose a self-collaboration framework that assigns roles and defines inter-agent collaboration to solve tasks.
- Instantiate an elementary three-role team (analyst, coder, tester) following software-development methodology (SDM).
- Demonstrate robustness and generality across multiple benchmarks and real-world-like tasks.
Proposed method
- Define division of labor (DOL) via role instructions to create specialized LLM experts.
- Implement collaboration by sharing a blackboard and formalizing inter-role coordination (Eq. 1 and Eq. 2).
- Instantiate an elementary team (analyst, coder, tester) using three ChatGPT roles to follow a waterfall-like SDM (analysis, coding, testing).
- Use role instructions to fix roles once per agent initialization, enabling subsequent interaction without re-prompts.
- Evaluate using Pass@k (Pass@1 emphasized) on MBPP, HumanEval, MBPP-ET, and HumanEval-ET with NL-only prompts and NL+signature+public test cases settings.
- Explore the impact of role-playing versus non-role prompts and measure the effect of interaction rounds (MI).],
- research_questions [
Experimental results
Key findings
- Self-collaboration improves code generation performance by 29.9%–47.1% Pass@1 over direct generation.
- An elementary three-role team (analyst, coder, tester) achieves state-of-the-art results on four code-generation benchmarks, sometimes surpassing GPT-4.
- Role-playing roles significantly outperforms non-role-playing baselines in NL-driven prompts.
- Interaction (more rounds of feedback) yields diminishing returns beyond the initial round but still provides consistent gains for complex tasks.
- The approach is especially beneficial on extended-test benchmarks (HumanEval-ET and MBPP-ET) indicating better handling of boundary cases and bugs.
- Case studies demonstrate the framework solving complex real-world tasks (e.g., a Python game) autonomously.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.