QUICK REVIEW

[논문 리뷰] SEMAG: Self-Evolutionary Multi-Agent Code Generation

Yulin Peng, Haowen Hou|arXiv (Cornell University)|2026. 03. 16.

Software Engineering Research인용 수 0

한 줄 요약

SEMAG은 실시간으로 계획, 디버깅 및 백본 모델을 적응적으로 조정하는 자기 진화형 다중 에이전트 코드 생성 프레임워크로, 일곱 벤치마크에서 Pass@1의 최첨단 성능을 달성합니다.

ABSTRACT

Large Language Models (LLMs) have made significant progress in handling complex programming tasks. However, current methods rely on manual model selection and fixed workflows, which limit their ability to adapt to changing task complexities. To address this, we propose SEMAG, a Self-Evolutionary Multi-Agent code Generation framework that mimics human coding practices. It decomposes programming tasks into stages, including planning, coding, debugging, and discussion, while adapting workflows to task difficulty. Its self-evolutionary agents can access the latest models in real time and automatically upgrade the backbone model. SEMAG sets new state-of-the-art Pass@1 accuracy across benchmarks. Using identical backbone models, SEMAG outperforms prior methods by 3.3% on CodeContests. When augmented with self-evolutionary model selection that automatically identifies optimal backbones, SEMAG reaches 52.6%, showcasing both framework effectiveness and adaptability to evolving LLM capabilities.

연구 동기 및 목표

대형 언어 모델 기반 코드 생성에서 적응적이고 동적인 워크플로의 필요성에 대한 동기 부여.
작업의 복잡성에 따라 추론 깊이와 워크플로를 조정하는 계층적 다중 에이전트 프레임워크 제안.
실시간으로 베이스라인 모델을 자동으로 선택하고 업그레이드하는 자기 진화 메커니즘 도입.
일곱 개의 텍스트-코드 벤치마크에서 최첨단 Pass@1 정확도 시연 및 효율성 향상 분석.

제안 방법

직접 생성에서 다중 에이전트 정제로 진화하는 4단계 계층적 코드 합성 프레임워크 도입.
추적 유사성에 의해 동적으로 수준 전이를 유도하는 적응형 레벨 전이 메커니즘 포함.
실시간으로 최적의 백본 모델을 선택하기 위해 병렬 모델-선택자 에이전트가 탐색, 필터링, 투표하는 자기 진화 구현.
계획, 검증, 디버깅, 토론 에이전트를 활용하고 토론-결정 모듈을 통해 지역 최적점을 피하고 해법을 구체화.

Figure 1: Overview workflow of Self-Evolution Agents. Agents integrate insights from recent research, news, and community discussions, dynamically identify and deploy the most suitable models.

실험 결과

연구 질문

RQ1자기 진화형 다중 에이전트 워크플로가 다양한 벤치마크에서 코드 생성 성능을 향상시킬 수 있는가?
RQ2적응적 계획 깊이와 협업 디버깅이 토큰 사용을 줄이면서 정확도를 높일 수 있는가?
RQ3작업 난이도와 모델 능력이 발전함에 따라 자동 백본 모델 전환이 높은 성능을 유지할 수 있는가?
RQ4계획에서 도구 사용의 포함 및 다양한 제거 실험이 전체 성능에 미치는 영향은 무엇인가?

주요 결과

모델/방법	HumanEval (GPT-3.5)	MBPP (GPT-3.5)	HumanEval-ET (GPT-3.5)	MBPP-ET (GPT-3.5)
SEMAG (저희)	91.5%	76.2%	79.9%	64.4%

SEMAG은 GPT-4o를 백본으로 사용하는 경우 일곱 벤치에서 Pass@1의 새로운 최첨단 수준을 달성(예: HumanEval 98.8%, MBPP 87.6%).
CodeContests에서 SEMAG은 고정 백본 기초선(LPW)에 비해 3.3% 향상된 38.0% Pass@1에 도달하며, 자기 진화로 이를 52.6%로 상승시킴.
적응적 계층형 프롬ptting은 고정 깊이 기초선과 비교하여 데이터 세트 전반에서 토큰 소비를 줄이면서 정확도를 향상시킴.
자체 구성요소(SMC: Plan-Verifier-Discuss-Decide)를 갖춘 전체 SEMAG이 부분 구성보다 우수하다는 제거 연구에서(예: GPT-3.5로 HumanEval에서 91.5% Pass@1) 확인.
병렬 선택자에 의한 자기 진화는 강력한 백본을 식별할 수 있음을 보여줌(예: Claude-3.7-Sonnet이 CodeContests에서 52.6% 달성; 다른 백본도 48.7-48.7% 달성).
도구 사용이 포함된 계획은 측정 가능한 이점을 제공(예: GPT-3.5로 HumanEval에서 Pass@1 3.7% 증가).

Figure 2: Overview of SEMAG. (1) Self-Evolve: Agents dynamically select optimal backbone LLMs per task requirements. (2) Plan: Planning Agent creates solution plans validated by Plan Verifying Agent through I/O simulation. (3) Debug: Coding Agent generates code; upon failure, specialized agents (Emb

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.