QUICK REVIEW

[论文解读] Experience-Driven Multi-Agent Systems Are Training-free Context-aware Earth Observers

Pengyu Dai, Weihao Xuan|arXiv (Cornell University)|Jan 30, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

GeoEvolver 是一个无需训练、以经验为驱动的多代理系统，能够在记忆库中积累细粒度的 EO 工具执行先验，从而在不更新参数的情况下提升端到端的地球观测任务表现。它将查询分解、探索工具配置，并将失败提炼为可重用的记忆。

ABSTRACT

Recent advances have enabled large language model (LLM) agents to solve complex tasks by orchestrating external tools. However, these agents often struggle in specialized, tool-intensive domains that demand long-horizon execution, tight coordination across modalities, and strict adherence to implicit tool constraints. Earth Observation (EO) tasks exemplify this challenge due to the multi-modal and multi-temporal data inputs, as well as the requirements of geo-knowledge constraints (spectrum library, spatial reasoning, etc): many high-level plans can be derailed by subtle execution errors that propagate through a pipeline and invalidate final results. A core difficulty is that existing agents lack a mechanism to learn fine-grained, tool-level expertise from interaction. Without such expertise, they cannot reliably configure tool parameters or recover from mid-execution failures, limiting their effectiveness in complex EO workflows. To address this, we introduce extbf{GeoEvolver}, a self-evolving multi-agent system~(MAS) that enables LLM agents to acquire EO expertise through structured interaction without any parameter updates. GeoEvolver decomposes each query into independent sub-goals via a retrieval-augmented multi-agent orchestrator, then explores diverse tool-parameter configurations at the sub-goal level. Successful patterns and root-cause attribution from failures are then distilled in an evolving memory bank that provides in-context demonstrations for future queries. Experiments on three tool-integrated EO benchmarks show that GeoEvolver consistently improves end-to-end task success, with an average gain of 12\% across multiple LLM backbones, demonstrating that EO expertise can emerge progressively from efficient, fine-grained interactions with the environment.

研究动机与目标

Identify why EO failures stem from execution-groundedness rather than planning alone.
Propose GeoEvolver to acquire EO expertise through structured interaction without updating model parameters.
Show that memory of execution experiences improves end-to-end EO task success across multiple LLM backbones.

提出的方法

Decompose each EO query into modular sub-goals assigned to specialized executors.
Use a retrieval-augmented orchestrator to assemble sub-goals from a memory bank of patterns and failures.
Allow parallel exploration with multiple variants and retries to find robust tool configurations.
Judge and validate sub-goal trajectories and propagate success/failure signals to memory.
Maintain a two-tier memory system: a global Memory Bank and a local Working Memory.
Iteratively distill successful patterns and failure attributions into the memory bank through single-variant extraction and contrastive distillation.

实验结果

研究问题

RQ1Does GeoEvolver improve end-to-end EO task performance across diverse LLM backbones?
RQ2How does model capacity affect GeoEvolver's gains across EO benchmarks?
RQ3Is GeoEvolver robust across EO benchmarks with different tool–modality couplings?
RQ4How does GeoEvolver compare to existing memory-based and multi-agent EO methods?
RQ5What is the impact of the number of executors, reasoning variants, and memory items on performance?

主要发现

Method	Tool-A-O ↑	Tool-I-O ↑	Tool-E-M ↑	Efficiency ↓	Accuracy ↑
Expel	32.72	25.94	22.48	1.79	22.58
Zhao et al. (Training-free GRPO)	57.24	44.36	36.44	1.36	31.25
Chase (DeepAgents)	41.67	33.98	25.45	1.06	29.69
Earth-Agent-MAS	32.28	26.96	20.91	1.47	15.87
Ours (GeoEvolver)	57.66	44.66	39.06	1.47	76.56

GeoEvolver achieves an average end-to-end accuracy gain of 12.56 percentage points across multiple backbones on EO benchmarks.
Smaller models gain disproportionately from memory-augmented experience, e.g., Qwen3-32B improves from 24.80% to 46.96% (+22.16 pp).
End-to-end accuracy improvements can accompany decreased step-level scores, indicating functionally correct but non-human trajectories.
GeoEvolver outperforms memory-based methods and fixed-workflow MAS on Earth-Agent benchmarks (e.g., 76.56% vs. 15.87% in Earth-Agent-MAS).
Ablations show self-contrast and parallel exploration contribute the largest gains, with notable drops when removed.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。