QUICK REVIEW

[论文解读] ROMA: Recursive Open Meta-Agent Framework for Long-Horizon Multi-Agent Systems

Salaheddin Alzu'bi, Baran Nama|arXiv (Cornell University)|Feb 2, 2026

Multi-Agent Systems and Negotiation被引用 0

一句话总结

ROMA 引入一个具有四种角色（Atomizer、Planner、Executor、Aggregator）的递归、模块化的元代理框架，并通过 GEPA+ 提示优化来提升跨多任务的长时推理与产出质量。

ABSTRACT

Current agentic frameworks underperform on long-horizon tasks. As reasoning depth increases, sequential orchestration becomes brittle, context windows impose hard limits that degrade performance, and opaque execution traces make failures difficult to localize or debug. We introduce ROMA (Recursive Open Meta-Agents), a domain-agnostic framework that addresses these limitations through recursive task decomposition and structured aggregation. ROMA decomposes goals into dependency-aware subtask trees that can be executed in parallel, while aggregation compresses and validates intermediate results to control context growth. Our framework standardizes agent construction around four modular roles --Atomizer (which decides whether a task should be decomposed), Planner, Executor, and Aggregator -- which cleanly separate orchestration from model selection and enable transparent, hierarchical execution traces. This design supports heterogeneous multi-agent systems that mix models and tools according to cost, latency, and capability. To adapt ROMA to specific tasks without fine-tuning, we further introduce GEPA$+$, an improved Genetic-Pareto prompt proposer that searches over prompts within ROMA's component hierarchy while preserving interface contracts. We show that ROMA, combined with GEPA+, delivers leading system-level performance on reasoning and long-form generation benchmarks. On SEAL-0, which evaluates reasoning over conflicting web evidence, ROMA instantiated with GLM-4.6 improves accuracy by 9.9\% over Kimi-Researcher. On EQ-Bench, a long-form writing benchmark, ROMA enables DeepSeek-V3 to match the performance of leading closed-source models such as Claude Sonnet 4.5. Our results demonstrate that recursive, modular agent architectures can scale reasoning depth while remaining interpretable, flexible, and model-agnostic.

研究动机与目标

解决长时代理系统的脆弱性和上下文窗口限制。
提供领域无关、可解释的架构，标准化任务分解与聚合。
在控制上下文增长的同时，实现异质模型/工具的并行执行。
引入 GEPA+，在不微调的情况下自动适配 ROMA 提示并提升性能。

提出的方法

定义一个包含四个模块化角色的递归控制循环：Atomizer、Planner、Executors、Aggregator。
将非原子任务分解为遵循依赖关系、支持并行执行的 MECE 子任务图。
聚合并压缩中间结果，生成更高层次的产物并控制上下文增长。
通过解耦编排与模型选择，支持异质模型与工具的协同工作。
引入 GEPA+，通过多提案生成、评估、验证与保持契约的融合，联合优化各组件的提示。
在推理和长篇生成基准上评估 ROMA，与基线进行比较并展示改进。

Figure 1: Overview of ROMA’s recursive architecture. An Atomizer determines whether a task is atomic. Atomic tasks are executed directly, while non-atomic tasks are decomposed into subtasks by a Planner . Each subtask is executed recursively by Executors , after which an Aggregator merges the output

实验结果

研究问题

RQ1ROMA 在具有递归任务分解的长时推理任务中的表现如何？
RQ2在扩展推理深度的同时，ROMA 的可解释性与可追溯性能否得到保持？
RQ3GEPA+ 提示优化是否能在不进行微调的情况下提升跨领域的任务适应性？
RQ4在 SEAL-0、FRAMES、SimpleQA、EQ-Bench 上，ROMA 相对于开源与闭源基线的表现如何？
RQ5在长篇生成过程中，ROMA 及其 GEPA+ 变体的计算成本与效率特征如何？

主要发现

ROMA 搭配 GLM-4.6 在 SEAL-0 上达到 45.9% 的准确率，相较 Kimi-Researcher 提升了 9.9 个百分点。
ROMA 搭配 GLM-4.6 在 FRAMES 上达到 82.3% 的准确率，为开源系统中最高。
在 SimpleQA 上，ROMA 搭配 GLM-4.6 达到 93.9%，为最佳开源结果，接近顶尖闭源水平。
EQ-Bench 长篇写作分数，ROMA + GEPA+ 达到 79.8%，与 Claude Sonnet 4.5 等同于顶级模型。
GEPA+ consistently 能带来 2–6 点的绝对准确度提升，评测次数少于标准 GEPA，提升任务适配效率。
ROMA 使 DeepSeek-V3 在结合 GEPA+ 时能够在 EQ-Bench 上达到与领先闭源模型相匹配的水平。
该架构表明，递归、模块化的智能体在保持可解释性与模型无关性的前提下，能够扩展推理深度。

Figure 2: ROMA’s hierarchical execution flow. Non-atomic tasks are decomposed top-down through planning, with left-to-right dependencies guiding execution, while results are combined bottom-up through aggregation. Executors operate on atomic subtasks, producing intermediate outputs that are aggregat

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。