QUICK REVIEW

[論文レビュー] AI-for-Science Low-code Platform with Bayesian Adversarial Multi-Agent Framework

Zihang Zeng, Jiaquan Zhang|arXiv (Cornell University)|Mar 3, 2026

Scientific Computing and Data Management被引用数 0

ひとこと要約

本論文は、低コードプラットフォームとして実装されたベイズ的対向マルチエージェント枠組み（Task Manager、Solution Generator、Evaluator）を提案し、コード、テスト、プロンプトを共進化させることで、 diverse LLMs にわたるAI-for-Scienceの堅牢なコード生成を改善します。

ABSTRACT

Large Language Models (LLMs) demonstrate potentials for automating scientific code generation but face challenges in reliability, error propagation in multi-agent workflows, and evaluation in domains with ill-defined success metrics. We present a Bayesian adversarial multi-agent framework specifically designed for AI for Science (AI4S) tasks in the form of a Low-code Platform (LCP). Three LLM-based agents are coordinated under the Bayesian framework: a Task Manager that structures user inputs into actionable plans and adaptive test cases, a Code Generator that produces candidate solutions, and an Evaluator providing comprehensive feedback. The framework employs an adversarial loop where the Task Manager iteratively refines test cases to challenge the Code Generator, while prompt distributions are dynamically updated using Bayesian principles by integrating code quality metrics: functional correctness, structural alignment, and static analysis. This co-optimization of tests and code reduces dependence on LLM reliability and addresses evaluation uncertainty inherent to scientific tasks. LCP also streamlines human-AI collaboration by translating non-expert prompts into domain-specific requirements, bypassing the need for manual prompt engineering by practitioners without coding backgrounds. Benchmark evaluations demonstrate LCP's effectiveness in generating robust code while minimizing error propagation. The proposed platform is also tested on an Earth Science cross-disciplinary task and demonstrates strong reliability, outperforming competing models.

研究の動機と目的

科学的タスクのためのマルチエージェントLLMコード生成における信頼性と誤 propagation のリスクを扱う。
非専門家が曖昧なドメインプロンプトを実行可能でドメイン一貫性のある要件へ変換できるようにする。
コード、テストケース、プロンプトを、特定の単一モデルへの依存を減らすベイズ更新規則を用いて共進化させる。
複数のベースモデルにわたる地球科学およびAI-for-Scienceのベンチマークで堅牢性とクロスドメインの有効性を示す。

提案手法

三エージェント構成：Task Manager (Challenger)、Solution Generator (Solver)、Evaluator が計画、テストケース、コードを反復的に共最適化する。
プロンプトのベイズ更新：p(Prompt^{t+1}_{ij}|S_3^t) ∝ p(S_3^t|Prompt^{t}_{ij}) p(Prompt^{t}_{ij})、単一のLLMに依存せずに再帰的な改良を可能にする。
ベイズ最適化による事前推定：生成コードをAST/コード埋め込みで埋め込み、 tested codes との構造的類似度に基づいて性能を予測し高価な評価を誘導する。
対戦的テストケース generation (ATC): TM が困難だが解けるテストケースを作成し SG を押し進め、堅牢性を向上させ誤 propag を減らす。
反復的評価フレームワーク：テストケーススコア S1、コードスコア S2、プロンプトスコア S3 を算出してベイズプロンプト更新と候補プロンプトの選択を駆動する。
サンプルコードプール管理：高い指示品質を持つサンプルコードのプールを維持・拡張し、SG からの新たに高性能なコードを取り込む。

Figure 1: Comparison between three code generation paradigms: Single LLM generator, multi-agent role playing and the proposed Bayesian adversarial multi-agent framework.

実験結果

リサーチクエスチョン

RQ1ベイズ的対向マルチエージェント枠組みは、多様なLLMに対してAI-for-Scienceのコード生成の信頼性と堅牢性を向上させるか？
RQ2対戦的テストケース生成機構はマルチエージェントのコード生成パイプラインにおける誤 propag を抑制できるか？
RQ3この枠組みはAI-for-Scienceのベンチマークおよび一般的なコード生成ベンチマークで最先端のベースラインと比較してどのような性能か？
RQ4非専門家のドメイン利用者は低コードプラットフォームを使って曖昧なプロンプトを専門的なプロンプト設計なしで実行可能な科学的ワークフローへ変換できるか？
RQ5複数回の反復的なプロンプト更新が解決案の質に与える影響はどの程度か？

主な発見

このフレームワークは堅牢な解を生み出し、基盤モデル間の誤 propag を 1.7B から 235B に削減し、地球科学ベンチマークで顕著な改善を示した。
SciCode では、 framework を備えた小型のオープンソースモデルが一部の設定でより大きなモデルに近づく、または凌ぐ性能を示す（例：Qwen3-14b が framework により大規模ベースラインと同等のケースあり）。
ScienceAgentBench で GPT-4o を用いた場合、 framework は最先端の Valid Execution Rate (VER) を達成し、SR/CBS スコアでも競争力を示す。
反復的なベイズ共更新は回迭ごとに性能を向上させ、ATC は後半の反復で追加の利得をもたらす。
この枠組みはプロンプト品質のロバスト性を示し、基本プロンプトと専門家プロンプトとの差を縮小し、非専門家でも高い成果を達成できる。

Figure 2: Overview of the Bayesian adversarial multi-agent framework. The three red arrows indicate fusion of the user-approved plan, test cases, and codes into prompts, the distribution of which is recursively updated under the Bayesian framework. $S_{1}$ , $S_{2}$ and $S_{3}$ are the scores comput

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。