QUICK REVIEW

[論文レビュー] SEMAG: Self-Evolutionary Multi-Agent Code Generation

Yulin Peng, Haowen Hou|arXiv (Cornell University)|Mar 16, 2026

Software Engineering Research被引用数 0

ひとこと要約

SEMAG は自己進化的なマルチエージェントフレームワークを通じてコード生成を実行し、計画・デバッグ・バックボーンモデルをリアルタイムで適応的に調整し、7つのベンチマークで Pass@1 の最先端を達成します。

ABSTRACT

Large Language Models (LLMs) have made significant progress in handling complex programming tasks. However, current methods rely on manual model selection and fixed workflows, which limit their ability to adapt to changing task complexities. To address this, we propose SEMAG, a Self-Evolutionary Multi-Agent code Generation framework that mimics human coding practices. It decomposes programming tasks into stages, including planning, coding, debugging, and discussion, while adapting workflows to task difficulty. Its self-evolutionary agents can access the latest models in real time and automatically upgrade the backbone model. SEMAG sets new state-of-the-art Pass@1 accuracy across benchmarks. Using identical backbone models, SEMAG outperforms prior methods by 3.3% on CodeContests. When augmented with self-evolutionary model selection that automatically identifies optimal backbones, SEMAG reaches 52.6%, showcasing both framework effectiveness and adaptability to evolving LLM capabilities.

研究の動機と目的

大規模言語モデルベースのコード生成における適応的・動的ワークフローの必要性を動機づける。
タスクの複雑さに応じて推論深度とワークフローを調整する階層的なマルチエージェントフレームワークを提案する。
リアルタイムで自動的にバックボーンモデルを選択・アップグレードする自己進化機構を導入する。
7つのテキスト to コードベンチマークで最先端の Pass@1 精度を示し、効率性の向上を分析する。

提案手法

直接的な生成からマルチエージェントによる高度化までを進行させる4レベル階層的コード合成フレームワークを導入する。
トレース類似度に基づく適応的なレベル遷移機構を組込み、レベルを動的に切り替える。
自己進化を実装：並列モデルセレクタエージェントが検索・フィルタ・投票を行い、リアルタイムで最適なバックボーンモデルを選択する。
計画・検証・デバッグ・討論エージェントとディスカッション–ディシジョンモジュールを用いて局所最適解を回避し、解を洗練させる。

Figure 1: Overview workflow of Self-Evolution Agents. Agents integrate insights from recent research, news, and community discussions, dynamically identify and deploy the most suitable models.

実験結果

リサーチクエスチョン

RQ1自己進化的なマルチエージェントワークフローは多様なベンチマークでコード生成性能を向上させるか。
RQ2適応的な計画深度と協調デバッグはトークン使用を削減しつつ精度を高めるか。
RQ3自動バックボーンモデル切替はタスク難易度とモデル能力の進化に伴い高い性能を維持できるか。
RQ4計画におけるツール使用の含有と様々なアブレーションが全体性能に与える影響はどうか。

主な発見

Model/Method	HumanEval (GPT-3.5)	MBPP (GPT-3.5)	HumanEval-ET (GPT-3.5)	MBPP-ET (GPT-3.5)
SEMAG (Ours)	91.5%	76.2%	79.9%	64.4%

SEMAG は 7つのベンチマークで最先端の Pass@1 を GPT-4o をバックボーンとして達成（例：HumanEval 98.8%、MBPP 87.6%）。
CodeContests では SEMAG が 38.0% Pass@1 を達成し、固定バックボーンベースライン（LPW）より 3.3% 向上、自己進化によりこれを 52.6% に引き上げる。
適応的階層 prompting は、固定深度ベースラインと比較してトークン消費を抑えつつ精度を向上。
アブレーション研究では、Plan-Verifier-Discuss-Decide を含む完全版 SEMAG が、部分的構成を上回る（例：GPT-3.5 で HumanEval の Pass@1 が 91.5%）。
自己進化による並列セレクタは強力なバックボーンを特定可能（例：Claude-3.7-Sonnet が CodeContests で 52.6%、他は 48.7-48.7%）。
ツール使用を含む計画は測定可能な利得を提供（HumanEval で GPT-3.5 の Pass@1 が 3.7% 向上）。

Figure 2: Overview of SEMAG. (1) Self-Evolve: Agents dynamically select optimal backbone LLMs per task requirements. (2) Plan: Planning Agent creates solution plans validated by Plan Verifying Agent through I/O simulation. (3) Debug: Coding Agent generates code; upon failure, specialized agents (Emb

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。