[论文解读] Experience-Driven Multi-Agent Systems Are Training-free Context-aware Earth Observers
GeoEvolver 是一个无需训练、以经验为驱动的多代理系统,能够在记忆库中积累细粒度的 EO 工具执行先验,从而在不更新参数的情况下提升端到端的地球观测任务表现。它将查询分解、探索工具配置,并将失败提炼为可重用的记忆。
Recent advances have enabled large language model (LLM) agents to solve complex tasks by orchestrating external tools. However, these agents often struggle in specialized, tool-intensive domains that demand long-horizon execution, tight coordination across modalities, and strict adherence to implicit tool constraints. Earth Observation (EO) tasks exemplify this challenge due to the multi-modal and multi-temporal data inputs, as well as the requirements of geo-knowledge constraints (spectrum library, spatial reasoning, etc): many high-level plans can be derailed by subtle execution errors that propagate through a pipeline and invalidate final results. A core difficulty is that existing agents lack a mechanism to learn fine-grained, tool-level expertise from interaction. Without such expertise, they cannot reliably configure tool parameters or recover from mid-execution failures, limiting their effectiveness in complex EO workflows. To address this, we introduce extbf{GeoEvolver}, a self-evolving multi-agent system~(MAS) that enables LLM agents to acquire EO expertise through structured interaction without any parameter updates. GeoEvolver decomposes each query into independent sub-goals via a retrieval-augmented multi-agent orchestrator, then explores diverse tool-parameter configurations at the sub-goal level. Successful patterns and root-cause attribution from failures are then distilled in an evolving memory bank that provides in-context demonstrations for future queries. Experiments on three tool-integrated EO benchmarks show that GeoEvolver consistently improves end-to-end task success, with an average gain of 12\% across multiple LLM backbones, demonstrating that EO expertise can emerge progressively from efficient, fine-grained interactions with the environment.
研究动机与目标
- Identify why EO failures stem from execution-groundedness rather than planning alone.
- Propose GeoEvolver to acquire EO expertise through structured interaction without updating model parameters.
- Show that memory of execution experiences improves end-to-end EO task success across multiple LLM backbones.
提出的方法
- Decompose each EO query into modular sub-goals assigned to specialized executors.
- Use a retrieval-augmented orchestrator to assemble sub-goals from a memory bank of patterns and failures.
- Allow parallel exploration with multiple variants and retries to find robust tool configurations.
- Judge and validate sub-goal trajectories and propagate success/failure signals to memory.
- Maintain a two-tier memory system: a global Memory Bank and a local Working Memory.
- Iteratively distill successful patterns and failure attributions into the memory bank through single-variant extraction and contrastive distillation.
实验结果
研究问题
- RQ1Does GeoEvolver improve end-to-end EO task performance across diverse LLM backbones?
- RQ2How does model capacity affect GeoEvolver's gains across EO benchmarks?
- RQ3Is GeoEvolver robust across EO benchmarks with different tool–modality couplings?
- RQ4How does GeoEvolver compare to existing memory-based and multi-agent EO methods?
- RQ5What is the impact of the number of executors, reasoning variants, and memory items on performance?
主要发现
| Method | Tool-A-O ↑ | Tool-I-O ↑ | Tool-E-M ↑ | Efficiency ↓ | Accuracy ↑ |
|---|---|---|---|---|---|
| Expel | 32.72 | 25.94 | 22.48 | 1.79 | 22.58 |
| Zhao et al. (Training-free GRPO) | 57.24 | 44.36 | 36.44 | 1.36 | 31.25 |
| Chase (DeepAgents) | 41.67 | 33.98 | 25.45 | 1.06 | 29.69 |
| Earth-Agent-MAS | 32.28 | 26.96 | 20.91 | 1.47 | 15.87 |
| Ours (GeoEvolver) | 57.66 | 44.66 | 39.06 | 1.47 | 76.56 |
- GeoEvolver achieves an average end-to-end accuracy gain of 12.56 percentage points across multiple backbones on EO benchmarks.
- Smaller models gain disproportionately from memory-augmented experience, e.g., Qwen3-32B improves from 24.80% to 46.96% (+22.16 pp).
- End-to-end accuracy improvements can accompany decreased step-level scores, indicating functionally correct but non-human trajectories.
- GeoEvolver outperforms memory-based methods and fixed-workflow MAS on Earth-Agent benchmarks (e.g., 76.56% vs. 15.87% in Earth-Agent-MAS).
- Ablations show self-contrast and parallel exploration contribute the largest gains, with notable drops when removed.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。