[论文解读] Complementary Reinforcement Learning
Complementary RL 共进化地使策略 actor 与经验提取器协同工作,以高效利用提炼的经验,在单任务设置中取得显著提升,在多任务设置中实现稳健提升。它使用一个带有集中式 MemoryManager 的异步框架来维护并查询不断演进的经验库。
Reinforcement Learning (RL) has emerged as a powerful paradigm for training LLM-based agents, yet remains limited by low sample efficiency, stemming not only from sparse outcome feedback but also from the agent's inability to leverage prior experience across episodes. While augmenting agents with historical experience offers a promising remedy, existing approaches suffer from a critical weakness: the experience distilled from history is either stored statically or fail to coevolve with the improving actor, causing a progressive misalignment between the experience and the actor's evolving capability that diminishes its utility over the course of training. Inspired by complementary learning systems in neuroscience, we present Complementary RL to achieve seamless co-evolution of an experience extractor and a policy actor within the RL optimization loop. Specifically, the actor is optimized via sparse outcome-based rewards, while the experience extractor is optimized according to whether its distilled experiences demonstrably contribute to the actor's success, thereby evolving its experience management strategy in lockstep with the actor's growing capabilities. Empirically, Complementary RL outperforms outcome-based agentic RL baselines that do not learn from experience, achieving 10% performance improvement in single-task scenarios and exhibits robust scalability in multi-task settings. These results establish Complementary RL as a paradigm for efficient experience-driven agent learning.
研究动机与目标
- 通过利用过去的经验来解决基于 LLM 的代理的 RL 样本效率问题。
- 实现策略 actor 与经验提取器之间的闭环共进化。
- 维护并蒸馏一个与 actor 不断演进的能力相一致的动态经验库。
- 设计一个可扩展经验管理且不阻塞 actor 更新的异步训练框架。
提出的方法
- 将 actor π_theta 与一个具有共享经验库 M 的经验提取器 π_phi 形式化。
- 通过 π_phi 从轨迹中蒸馏经验 m,并基于其对成功的贡献分配二元奖励,使用 CISPO 目标优化。
- 使用 GRPO 目标以基于结果的奖励训练 actor π_theta,采用拆分优势方案将经验引导和无经验回放分离。
- 实现一个完全异步的训练框架,集中式 ExperienceManager 处理经验整合、检索,以及 π_theta 与 π_phi 的协同进化。
- 加入 Merge 操作以减少 M 的冗余,以及 search_and_ask 工具在决策点提升定向检索的机制。
实验结果
研究问题
- RQ1共进化的 actor 与经验提取器是否能在学习效率上超越静态或离线经验基线?
- RQ2不断演进的经验提取器应如何设计,以持续匹配 actor 的成长能力?
- RQ3一个异步、集中式的训练框架在扩展经验管理的同时是否能保持吞吐量?
- RQ4共进化和基于经验的检索对单任务与多任务的性能有何影响?
主要发现
| Method | MiniHack Room | WebShop | ALFWorld | Avg. |
|---|---|---|---|---|
| Baseline | 0.68 | 0.81 | 0.72 | 0.75 |
| Static Online Exp. (eval w/ exp.) | 0.41 | 0.67 | 0.69 | 0.59 |
| Static Online Exp. (eval w/o exp.) | 0.39 | 0.59 | 0.64 | 0.54 |
| Exp. Only | 0.49 | 0.37 | 0.13 | 0.33 |
| Comp. RL (eval w/ exp.) | 0.78 | 0.87 | 0.82 | 0.82 |
| Comp. RL (eval w/o exp.) | 0.75 | 0.84 | 0.74 | 0.78 |
- Complementary RL 在四个单任务环境中持续优于基线,在单任务场景中约提升 10%。
- 在多任务设置中,Complementary RL 展现出对学习动力学的鲁棒可扩展性。
- 从经验库进行的测试时检索有助于提升性能,但单纯的静态在线经验由于不对齐而落后于基线。
- 更大的经验提取器在所有任务上平均再提升约 5%。
- 自我蒸馏可以在早期带来增益,但若处理不慎,后期可能会崩塌。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。