Skip to main content
QUICK REVIEW

[논문 리뷰] ROMA: Recursive Open Meta-Agent Framework for Long-Horizon Multi-Agent Systems

Salaheddin Alzu'bi, Baran Nama|arXiv (Cornell University)|2026. 02. 02.
Multi-Agent Systems and Negotiation인용 수 0
한 줄 요약

ROMA는 네 가지 역할(Atomizer, Planner, Executor, Aggregator)과 GEPA+ 프롬프트 최적화를 통해 다양한 작업에서 장기 시나리오 추론과 생성을 개선하는 재귀적 모듈형 메타 에이전트 프레임워크를 도입합니다.

ABSTRACT

Current agentic frameworks underperform on long-horizon tasks. As reasoning depth increases, sequential orchestration becomes brittle, context windows impose hard limits that degrade performance, and opaque execution traces make failures difficult to localize or debug. We introduce ROMA (Recursive Open Meta-Agents), a domain-agnostic framework that addresses these limitations through recursive task decomposition and structured aggregation. ROMA decomposes goals into dependency-aware subtask trees that can be executed in parallel, while aggregation compresses and validates intermediate results to control context growth. Our framework standardizes agent construction around four modular roles --Atomizer (which decides whether a task should be decomposed), Planner, Executor, and Aggregator -- which cleanly separate orchestration from model selection and enable transparent, hierarchical execution traces. This design supports heterogeneous multi-agent systems that mix models and tools according to cost, latency, and capability. To adapt ROMA to specific tasks without fine-tuning, we further introduce GEPA$+$, an improved Genetic-Pareto prompt proposer that searches over prompts within ROMA's component hierarchy while preserving interface contracts. We show that ROMA, combined with GEPA+, delivers leading system-level performance on reasoning and long-form generation benchmarks. On SEAL-0, which evaluates reasoning over conflicting web evidence, ROMA instantiated with GLM-4.6 improves accuracy by 9.9\% over Kimi-Researcher. On EQ-Bench, a long-form writing benchmark, ROMA enables DeepSeek-V3 to match the performance of leading closed-source models such as Claude Sonnet 4.5. Our results demonstrate that recursive, modular agent architectures can scale reasoning depth while remaining interpretable, flexible, and model-agnostic.

연구 동기 및 목표

  • 장기 시나리오 에이전트 시스템의 약한 견고성과 컨텍스트 윈도우 한계에 대응.
  • 표준화된 작업 분해 및 집계로 도메인에 의존하지 않는 해석 가능한 아키텍처를 제공.
  • 컨텍스트 성장 관리와 함께 이기종 모델/도구의 사용 및 병렬 실행을 가능하게 함.
  • 미세조정 없이 ROMA 프롬프트를 자동으로 적응시키고 성능을 개선하기 위한 GEPA+ 도입.

제안 방법

  • 네 가지 모듈형 역할(Atomizer, Planner, Executors, Aggregator)로 재귀적 제어 루프를 정의.
  • 의존성을 존중하고 병렬 실행을 가능하게 하는 MECE 서브태스크 그래프 형태로 비분해 가능한 작업을 분해.
  • 중간 결과를 집계하고 압축하여 고수준 산출물을 생성하고 컨텍스트 성장을 제어.
  • orchestr at ion을 모델 선택에서 분리하여 이기종 모델과 도구의 사용을 지원.
  • GEPA+를 도입하여 다중 제안 생성, 판단, 검증 및 계약 보존적 병합을 통해 구성 요소 간 프롬프트를 공동 최적화.
  • ROMA를 추론 및 장문 생성 벤치마크에서 평가하고 기준선과 비교하여 개선점을 제시.
Figure 1: Overview of ROMA’s recursive architecture. An Atomizer determines whether a task is atomic. Atomic tasks are executed directly, while non-atomic tasks are decomposed into subtasks by a Planner . Each subtask is executed recursively by Executors , after which an Aggregator merges the output
Figure 1: Overview of ROMA’s recursive architecture. An Atomizer determines whether a task is atomic. Atomic tasks are executed directly, while non-atomic tasks are decomposed into subtasks by a Planner . Each subtask is executed recursively by Executors , after which an Aggregator merges the output

실험 결과

연구 질문

  • RQ1ROMA가 재귀적 작업 분해를 통한 장기 시나리오 추론 작업에서 어떻게 성능을 발휘하는가?
  • RQ2추론 깊이를 확장하더라도 해석 가능성과 추적 가능성을 ROMA 아키텍처가 유지하는가?
  • RQ3GEPA+ 프롬프트 최적화가 도메인 간 미세조정 없이 ROMA의 작업 적응력을 개선하는가?
  • RQ4SEAL-0, FRAMES, SimpleQA, EQ-Bench에서 오픈 소스 및 클로즈드 소스 벤치마크에 대해 ROMA는 어떻게 성능을 보이는가?
  • RQ5장문 생성을 수행하는 동안 ROMA와 GEPA+ 변형의 계산 비용과 효율성 특성은 어떠한가?

주요 결과

  • ROMA with GLM-4.6 achieves 45.9% accuracy on SEAL-0, a 9.9 percentage point improvement over Kimi-Researcher.
  • ROMA with GLM-4.6 attains 82.3% on FRAMES, the highest among open-source systems.
  • On SimpleQA, ROMA with GLM-4.6 reaches 93.9%, the best open-source result and near top closed-source levels.
  • EQ-Bench long-form writing score with ROMA + GEPA+ reaches 79.8%, matching Claude Sonnet 4.5 among top models.
  • GEPA+ consistently yields 2–6 point absolute accuracy gains with fewer evaluations than standard GEPA, improving task adaptation efficiency.
  • ROMA enables DeepSeek-V3 to match leading closed-source models on EQ-Bench when combined with GEPA+.
  • The architecture demonstrates that recursive, modular agents can scale reasoning depth while remaining interpretable and model-agnostic.
Figure 2: ROMA’s hierarchical execution flow. Non-atomic tasks are decomposed top-down through planning, with left-to-right dependencies guiding execution, while results are combined bottom-up through aggregation. Executors operate on atomic subtasks, producing intermediate outputs that are aggregat
Figure 2: ROMA’s hierarchical execution flow. Non-atomic tasks are decomposed top-down through planning, with left-to-right dependencies guiding execution, while results are combined bottom-up through aggregation. Executors operate on atomic subtasks, producing intermediate outputs that are aggregat

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.