Skip to main content
QUICK REVIEW

[论文解读] Experience-Driven Multi-Agent Systems Are Training-free Context-aware Earth Observers

Pengyu Dai, Weihao Xuan|arXiv (Cornell University)|Jan 30, 2026
Multimodal Machine Learning Applications被引用 0
一句话总结

GeoEvolver 是一个无需训练、以经验为驱动的多代理系统,能够在记忆库中积累细粒度的 EO 工具执行先验,从而在不更新参数的情况下提升端到端的地球观测任务表现。它将查询分解、探索工具配置,并将失败提炼为可重用的记忆。

ABSTRACT

Recent advances have enabled large language model (LLM) agents to solve complex tasks by orchestrating external tools. However, these agents often struggle in specialized, tool-intensive domains that demand long-horizon execution, tight coordination across modalities, and strict adherence to implicit tool constraints. Earth Observation (EO) tasks exemplify this challenge due to the multi-modal and multi-temporal data inputs, as well as the requirements of geo-knowledge constraints (spectrum library, spatial reasoning, etc): many high-level plans can be derailed by subtle execution errors that propagate through a pipeline and invalidate final results. A core difficulty is that existing agents lack a mechanism to learn fine-grained, tool-level expertise from interaction. Without such expertise, they cannot reliably configure tool parameters or recover from mid-execution failures, limiting their effectiveness in complex EO workflows. To address this, we introduce extbf{GeoEvolver}, a self-evolving multi-agent system~(MAS) that enables LLM agents to acquire EO expertise through structured interaction without any parameter updates. GeoEvolver decomposes each query into independent sub-goals via a retrieval-augmented multi-agent orchestrator, then explores diverse tool-parameter configurations at the sub-goal level. Successful patterns and root-cause attribution from failures are then distilled in an evolving memory bank that provides in-context demonstrations for future queries. Experiments on three tool-integrated EO benchmarks show that GeoEvolver consistently improves end-to-end task success, with an average gain of 12\% across multiple LLM backbones, demonstrating that EO expertise can emerge progressively from efficient, fine-grained interactions with the environment.

研究动机与目标

  • Identify why EO failures stem from execution-groundedness rather than planning alone.
  • Propose GeoEvolver to acquire EO expertise through structured interaction without updating model parameters.
  • Show that memory of execution experiences improves end-to-end EO task success across multiple LLM backbones.

提出的方法

  • Decompose each EO query into modular sub-goals assigned to specialized executors.
  • Use a retrieval-augmented orchestrator to assemble sub-goals from a memory bank of patterns and failures.
  • Allow parallel exploration with multiple variants and retries to find robust tool configurations.
  • Judge and validate sub-goal trajectories and propagate success/failure signals to memory.
  • Maintain a two-tier memory system: a global Memory Bank and a local Working Memory.
  • Iteratively distill successful patterns and failure attributions into the memory bank through single-variant extraction and contrastive distillation.

实验结果

研究问题

  • RQ1Does GeoEvolver improve end-to-end EO task performance across diverse LLM backbones?
  • RQ2How does model capacity affect GeoEvolver's gains across EO benchmarks?
  • RQ3Is GeoEvolver robust across EO benchmarks with different tool–modality couplings?
  • RQ4How does GeoEvolver compare to existing memory-based and multi-agent EO methods?
  • RQ5What is the impact of the number of executors, reasoning variants, and memory items on performance?

主要发现

MethodTool-A-O ↑Tool-I-O ↑Tool-E-M ↑Efficiency ↓Accuracy ↑
Expel32.7225.9422.481.7922.58
Zhao et al. (Training-free GRPO)57.2444.3636.441.3631.25
Chase (DeepAgents)41.6733.9825.451.0629.69
Earth-Agent-MAS32.2826.9620.911.4715.87
Ours (GeoEvolver)57.6644.6639.061.4776.56
  • GeoEvolver achieves an average end-to-end accuracy gain of 12.56 percentage points across multiple backbones on EO benchmarks.
  • Smaller models gain disproportionately from memory-augmented experience, e.g., Qwen3-32B improves from 24.80% to 46.96% (+22.16 pp).
  • End-to-end accuracy improvements can accompany decreased step-level scores, indicating functionally correct but non-human trajectories.
  • GeoEvolver outperforms memory-based methods and fixed-workflow MAS on Earth-Agent benchmarks (e.g., 76.56% vs. 15.87% in Earth-Agent-MAS).
  • Ablations show self-contrast and parallel exploration contribute the largest gains, with notable drops when removed.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。