QUICK REVIEW

[논문 리뷰] AutoDev: Automated AI-Driven Development

Michele Tufano, Anisha Agarwal|arXiv (Cornell University)|2024. 03. 13.

Scientific Computing and Data Management인용 수 12

한 줄 요약

AutoDev은 Docker로 격리된 환경에서 태스크를 자율적으로 계획하고 실행하는 완전 자동화된 AI 기반 소프트웨어 개발 프레임워크로, 코드 편집, 빌드, 테스트 및 깃 작업을 통해 사용자 정의 목표를 달성합니다. 추가 학습 없이 HumanEval에서 강력한 코드 및 테스트 생성 성능을 보여줍니다.

ABSTRACT

The landscape of software development has witnessed a paradigm shift with the advent of AI-powered assistants, exemplified by GitHub Copilot. However, existing solutions are not leveraging all the potential capabilities available in an IDE such as building, testing, executing code, git operations, etc. Therefore, they are constrained by their limited capabilities, primarily focusing on suggesting code snippets and file manipulation within a chat-based interface. To fill this gap, we present AutoDev, a fully automated AI-driven software development framework, designed for autonomous planning and execution of intricate software engineering tasks. AutoDev enables users to define complex software engineering objectives, which are assigned to AutoDev's autonomous AI Agents to achieve. These AI agents can perform diverse operations on a codebase, including file editing, retrieval, build processes, execution, testing, and git operations. They also have access to files, compiler output, build and testing logs, static analysis tools, and more. This enables the AI Agents to execute tasks in a fully automated manner with a comprehensive understanding of the contextual information required. Furthermore, AutoDev establishes a secure development environment by confining all operations within Docker containers. This framework incorporates guardrails to ensure user privacy and file security, allowing users to define specific permitted or restricted commands and operations within AutoDev. In our evaluation, we tested AutoDev on the HumanEval dataset, obtaining promising results with 91.5% and 87.8% of Pass@1 for code generation and test generation respectively, demonstrating its effectiveness in automating software engineering tasks while maintaining a secure and user-controlled development environment.

연구 동기 및 목표

Motivate autonomous AI-driven software development beyond code snippet suggestions.
Enable complex SE tasks to be executed by AI agents with full repository access.
Provide a secure, configurable environment with guardrails and permissions.
Demonstrate effectiveness on code generation and test generation benchmarks.

제안 방법

Four-component architecture: Conversation Manager, Tools Library, Agent Scheduler, and Evaluation Environment.
Rule and action configuration via YAML to customize agent permissions and capabilities.
Agents (LLMs/SLMs) propose repository actions and are orchestrated by the Agent Scheduler.
Secure execution of actions inside Docker-based Evaluation Environment with outputs fed back into conversations.
Command tools include file editing, retrieval, build/execution, testing, and git operations; actions are parsed and validated before execution.

Figure 1. AutoDev enables an AI Agent to achieve a given objective by performing several actions within the repository. The Eval Environment executes the suggested operations, providing the AI Agent with the resulting outcome. In the conversation, purple messages are from the AI agent, while blue me

실험 결과

연구 질문

RQ1RQ1: How effective is AutoDev in code generation on the HumanEval dataset (Pass@1)?
RQ2RQ2: How effective is AutoDev in test generation on HumanEval (Pass@1 and coverage)?
RQ3RQ3: How efficient is AutoDev in completing tasks (number of steps, tokens, and command distribution).

주요 결과

접근 방식	모델	추가 교육	Pass@1
Language Agent Tree Search	GPT-4	✓	94.4
AutoDev	GPT-4	×	91.5
Reflexion	GPT-4	✓	91.0
제로샷(베이스라인)	GPT-4	×	67.0
Passing	Coverage	Overall	Coverage
사람	-	100	99.4	99.4
AutoDev	GPT-4	87.8	99.3	88.8
제로샷(베이스라인)	GPT-4	75	99.3	74

AutoDev achieves Pass@1 of 91.5% for code generation on HumanEval, placing second on the leaderboard without extra training data.
AutoDev achieves Pass@1 of 87.8% for test generation on modified HumanEval, with coverage comparable to human-written tests (99.3% vs 99.4%).
AutoDev improves GPT-4 performance from 67% to 91.5% on code generation tasks, a 30% relative improvement.
For code generation, AutoDev uses an average of 5.5 commands per task (including 1.8 write, 1.7 test, and 0.92 stop); for test generation, it uses about 6.5 commands on average.
The approach executes tasks through iterative, autonomous cycles within a secure Docker-based Evaluation Environment, with guardrails for permissions and reproducibility.
AutoDev demonstrates multi-agent collaboration potential and human-in-the-loop capabilities (talk/ask) for future enhancements.

Figure 2. Overview of the AutoDev Framework: The user initiates the process by defining the objective to be achieved. The Conversation Manager initializes the conversation and settings. The Agent Scheduler orchestrates AI agents to collaborate on the task and forwards their commands to the Conversat

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.