QUICK REVIEW

[논문 리뷰] Counterfactual Conditional Likelihood Rewards for Multiagent Exploration

Ayhan Alp Aydeniz, Robert Loftin|arXiv (Cornell University)|2026. 02. 12.

Reinforcement Learning in Robotics인용 수 0

한 줄 요약

이 논문은 Counterfactual Conditional Likelihood (CCL) 보상을 도입하여 희소 보상을 가진 협력 다중에이전트 환경에서 각 에이전트의 고유한 기여를 공동 탐색에 대해 측정하고 촉진하여 조정성과 학습 효율성을 향상시킨다.

ABSTRACT

Efficient exploration is critical for multiagent systems to discover coordinated strategies, particularly in open-ended domains such as search and rescue or planetary surveying. However, when exploration is encouraged only at the individual agent level, it often leads to redundancy, as agents act without awareness of how their teammates are exploring. In this work, we introduce Counterfactual Conditional Likelihood (CCL) rewards, which score each agent's exploration by isolating its unique contribution to team exploration. Unlike prior methods that reward agents solely for the novelty of their individual observations, CCL emphasizes observations that are informative with respect to the joint exploration of the team. Experiments in continuous multiagent domains show that CCL rewards accelerate learning for domains with sparse team rewards, where most joint actions yield zero rewards, and are particularly effective in tasks that require tight coordination among agents.

연구 동기 및 목표

희소한 팀 보상이 주어지는 다중에이전트 시스템에서 조정된 탐색을 촉진한다.
로컬 관측만 보상하는 대신 각 에이전트의 공동 탐색에 대한 한계 기여를 분리하고 로컬 관측만 보상하지 않는다.
상태 공간의 팀 전체 포괄에 정보를 제공하는 정보에 집중하여 중복 탐색을 피한다.
무작위 로컬 인코더와 반사실(conditioning) 조건화를 사용하여 확장 가능한 추정을 가능하게 한다.

제안 방법

Counterfactual Conditional Likelihood (CCL) 보상을 다른 에이전트에 조건화된 실제 관측치와 반사실 관측치 간의 로그 가능도 차이로 정의한다.
각 에이전트의 관측치를 고정된 임의 인코더로 임베딩하고 이 로컬 임베딩들로부터 합동 임베딩을 형성한다.
안정성을 위해 공유 반지름을 갖는 임베딩된 합동 공간에서 k-NN 밀도를 통해 가능도를 추정한다.
조건부 로그 가능도의 digamma 기반 대리함수를 사용하여 CCL 보상을 계산하고 안정성을 위해 Softplus 기반 형태화를 적용한다.
joint 와 local exploration의 균형을 맞추기 위해 혼합 보상으로 CCL을 로컬 관찰 엔트로피 최대화(OEM)와 선택적으로 결합하여 공동 탐색과 국지 탐색의 균형을 맞춘다.
CTDE(중앙집중식 학습 with 분산 실행)하에서 MAPPO로 학습하고 에이전트에 LSTM 기반 아키텍처를 사용한다.

Figure 1: Heat maps of agent trajectories in the multi-rover domain for coupling factor 5 with 2 POIs and 10 agents (Figure 4 ). Maps show how agents under different exploration strategies (CCL, Mixture, and Local Entropy) distribute their movements in the environment. CCL encourages more coordinate

실험 결과

연구 질문

RQ1로컬 OEM에 비해 CCL 보상이 희소 보상 다중에이전트 작업에서 탐색 효율을 향상시키는가?
RQ2중복 탐색을 줄이고 상호보완적 행동을 촉진하여 CCL이 조정 품질을 향상시키는가?
RQ3혼합 보상을 통해 CCL과 로컬 OEM을 결합하면 추가 이점을 얻는가?
RQ4다양한 작업 난이도, 에이전트 수, 보상 희소성의 변화에 대해 CCL은 얼마나 강건한가?

주요 결과

로컬 OEM에 비해 CCL은 희소 보상 다중 로버 도메인에서 탐색을 크게 향상시킨다.
CCL은 더 조정되고 상호 보완적인 에이전트 궤적과 더 높은 팀 보상을 이끈다.
혼합 보상은 간단한 설정에서 더 빠른 초기 수렴과 더 높은 최고 성능을 제공하지만, 더 어려운 조정이 필요한 작업에서는 이익이 감소한다.
CCL은 적대적 입자 환경을 포함한 다양한 도메인에서 일반화되며, 에이전트 수 및 결합 요건의 변화에도 강건하다.

Figure 2: Comparison of exploration strategies in the multi-rover domain across different coupling factors 3, 4, and 5, with teams of 6, 8, and 10 agents, respectively. The environment has two distantly placed POIs. Results show that CCL improves coordinated behaviors and achieve higher performance

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.