[논문 리뷰] AutoDev: Automated AI-Driven Development
AutoDev은 Docker로 격리된 환경에서 태스크를 자율적으로 계획하고 실행하는 완전 자동화된 AI 기반 소프트웨어 개발 프레임워크로, 코드 편집, 빌드, 테스트 및 깃 작업을 통해 사용자 정의 목표를 달성합니다. 추가 학습 없이 HumanEval에서 강력한 코드 및 테스트 생성 성능을 보여줍니다.
The landscape of software development has witnessed a paradigm shift with the advent of AI-powered assistants, exemplified by GitHub Copilot. However, existing solutions are not leveraging all the potential capabilities available in an IDE such as building, testing, executing code, git operations, etc. Therefore, they are constrained by their limited capabilities, primarily focusing on suggesting code snippets and file manipulation within a chat-based interface. To fill this gap, we present AutoDev, a fully automated AI-driven software development framework, designed for autonomous planning and execution of intricate software engineering tasks. AutoDev enables users to define complex software engineering objectives, which are assigned to AutoDev's autonomous AI Agents to achieve. These AI agents can perform diverse operations on a codebase, including file editing, retrieval, build processes, execution, testing, and git operations. They also have access to files, compiler output, build and testing logs, static analysis tools, and more. This enables the AI Agents to execute tasks in a fully automated manner with a comprehensive understanding of the contextual information required. Furthermore, AutoDev establishes a secure development environment by confining all operations within Docker containers. This framework incorporates guardrails to ensure user privacy and file security, allowing users to define specific permitted or restricted commands and operations within AutoDev. In our evaluation, we tested AutoDev on the HumanEval dataset, obtaining promising results with 91.5% and 87.8% of Pass@1 for code generation and test generation respectively, demonstrating its effectiveness in automating software engineering tasks while maintaining a secure and user-controlled development environment.
연구 동기 및 목표
- Motivate autonomous AI-driven software development beyond code snippet suggestions.
- Enable complex SE tasks to be executed by AI agents with full repository access.
- Provide a secure, configurable environment with guardrails and permissions.
- Demonstrate effectiveness on code generation and test generation benchmarks.
제안 방법
- Four-component architecture: Conversation Manager, Tools Library, Agent Scheduler, and Evaluation Environment.
- Rule and action configuration via YAML to customize agent permissions and capabilities.
- Agents (LLMs/SLMs) propose repository actions and are orchestrated by the Agent Scheduler.
- Secure execution of actions inside Docker-based Evaluation Environment with outputs fed back into conversations.
- Command tools include file editing, retrieval, build/execution, testing, and git operations; actions are parsed and validated before execution.

실험 결과
연구 질문
- RQ1RQ1: How effective is AutoDev in code generation on the HumanEval dataset (Pass@1)?
- RQ2RQ2: How effective is AutoDev in test generation on HumanEval (Pass@1 and coverage)?
- RQ3RQ3: How efficient is AutoDev in completing tasks (number of steps, tokens, and command distribution).
주요 결과
| 접근 방식 | 모델 | 추가 교육 | Pass@1 | |
|---|---|---|---|---|
| Language Agent Tree Search | GPT-4 | ✓ | 94.4 | |
| AutoDev | GPT-4 | × | 91.5 | |
| Reflexion | GPT-4 | ✓ | 91.0 | |
| 제로샷(베이스라인) | GPT-4 | × | 67.0 | |
| Passing | Coverage | Overall | Coverage | |
| 사람 | - | 100 | 99.4 | 99.4 |
| AutoDev | GPT-4 | 87.8 | 99.3 | 88.8 |
| 제로샷(베이스라인) | GPT-4 | 75 | 99.3 | 74 |
- AutoDev achieves Pass@1 of 91.5% for code generation on HumanEval, placing second on the leaderboard without extra training data.
- AutoDev achieves Pass@1 of 87.8% for test generation on modified HumanEval, with coverage comparable to human-written tests (99.3% vs 99.4%).
- AutoDev improves GPT-4 performance from 67% to 91.5% on code generation tasks, a 30% relative improvement.
- For code generation, AutoDev uses an average of 5.5 commands per task (including 1.8 write, 1.7 test, and 0.92 stop); for test generation, it uses about 6.5 commands on average.
- The approach executes tasks through iterative, autonomous cycles within a secure Docker-based Evaluation Environment, with guardrails for permissions and reproducibility.
- AutoDev demonstrates multi-agent collaboration potential and human-in-the-loop capabilities (talk/ask) for future enhancements.

더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.