QUICK REVIEW

[论文解读] Counterfactual Conditional Likelihood Rewards for Multiagent Exploration

Ayhan Alp Aydeniz, Robert Loftin|arXiv (Cornell University)|Feb 12, 2026

Reinforcement Learning in Robotics被引用 0

一句话总结

该论文提出 Counterfactual Conditional Likelihood (CCL) 奖励，用于衡量并提升在稀疏奖励的合作多智能体环境中每个代理对联合探索的独特贡献，从而改善协调与学习效率。

ABSTRACT

Efficient exploration is critical for multiagent systems to discover coordinated strategies, particularly in open-ended domains such as search and rescue or planetary surveying. However, when exploration is encouraged only at the individual agent level, it often leads to redundancy, as agents act without awareness of how their teammates are exploring. In this work, we introduce Counterfactual Conditional Likelihood (CCL) rewards, which score each agent's exploration by isolating its unique contribution to team exploration. Unlike prior methods that reward agents solely for the novelty of their individual observations, CCL emphasizes observations that are informative with respect to the joint exploration of the team. Experiments in continuous multiagent domains show that CCL rewards accelerate learning for domains with sparse team rewards, where most joint actions yield zero rewards, and are particularly effective in tasks that require tight coordination among agents.

研究动机与目标

在稀疏团队奖励的多智能体系统中促进协同探索。
将每个代理的边际贡献与联合探索分离出来，而非仅奖励本地观察。
通过聚焦于信息量化全队覆盖状态空间的信息，避免冗余探索。
利用随机局部编码器和反事实条件化实现可扩展估计。

提出的方法

将 Counterfactual Conditional Likelihood (CCL) 奖励定义为在给定其他代理的条件下，实际观测与反事实观测的对数似然之差。
用固定随机编码器对每个代理的观测进行编码，并从这些局部嵌入形成联合嵌入。
在嵌入的联合空间中通过最近邻密度估计似然性，使用固定半径以保证稳定性。
使用基于Digamma的条件对数似然代理来计算 CCL 奖励，并应用基于 Softplus 的整形以提高稳定性。
可选地通过混合奖励将 CCL 与局部观测熵最大化（OEM）结合，以在联合探索与本地探索之间取得平衡。
在集中训练、分散执行（CTDE）框架下使用 MAPPO 进行训练，并对代理采用基于 LSTM 的架构。

Figure 1: Heat maps of agent trajectories in the multi-rover domain for coupling factor 5 with 2 POIs and 10 agents (Figure 4 ). Maps show how agents under different exploration strategies (CCL, Mixture, and Local Entropy) distribute their movements in the environment. CCL encourages more coordinate

实验结果

研究问题

RQ1与本地 OEM 相比，CCL 奖励是否在稀疏奖励多智能体任务中提升探索效率？
RQ2CCL 是否通过减少冗余探索、促进互补行为来提升协同质量？
RQ3通过混合奖励将 CCL 与本地 OEM 结合是否带来额外收益？
RQ4CCL 对任务复杂度、代理数量和奖励稀疏性等变化的鲁棒性如何？

主要发现

与本地 OEM 相比，CCL 在稀疏奖励的多车辆域中显著提升了探索效果。
CCL 导致更具协同性和互补性的代理轨迹，以及更高的团队奖励。
混合奖励在简单情境下实现了更快的早期收敛和更高的峰值性能，但在更困难的协同任务中增益减弱。
CCL 能在跨领域泛化，包括对抗性粒子环境，并且对代理数量和耦合需求的变化具有鲁棒性。

Figure 2: Comparison of exploration strategies in the multi-rover domain across different coupling factors 3, 4, and 5, with teams of 6, 8, and 10 agents, respectively. The environment has two distantly placed POIs. Results show that CCL improves coordinated behaviors and achieve higher performance

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。