Skip to main content
QUICK REVIEW

[논문 리뷰] LEARNING TO SCHEDULE COMMUNICATION IN MULTI-AGENT REINFORCEMENT LEARNING

Daewoo Kim, Sangwoo Moon|arXiv (Cornell University)|2019. 02. 05.
Energy Harvesting in Wireless Networks참고 문헌 29인용 수 59
한 줄 요약

SchedNet은 중앙 집중식 비평가와 분산된 에이전트를 학습시켜 제한 대역폭과 공유 매체 제약 하에서 에이전트가 언제 어떻게 소통해야 하는지 학습시켜 비소통 및 단순 스케줄 baselines 대비 협력 MARL 성능을 향상시킵니다.

ABSTRACT

Many real-world reinforcement learning tasks require multiple agents to make sequential decisions under the agents' interaction, where well-coordinated actions among the agents are crucial to achieve the target goal better at these tasks. One way to accelerate the coordination effect is to enable multiple agents to communicate with each other in a distributed manner and behave as a group. In this paper, we study a practical scenario when (i) the communication bandwidth is limited and (ii) the agents share the communication medium so that only a restricted number of agents are able to simultaneously use the medium, as in the state-of-the-art wireless networking standards. This calls for a certain form of communication scheduling. In that regard, we propose a multi-agent deep reinforcement learning framework, called SchedNet, in which agents learn how to schedule themselves, how to encode the messages, and how to select actions based on received messages. SchedNet is capable of deciding which agents should be entitled to broadcasting their (encoded) messages, by learning the importance of each agent's partially observed information. We evaluate SchedNet against multiple baselines under two different applications, namely, cooperative communication and navigation, and predator-prey. Our experiments show a non-negligible performance gap between SchedNet and other mechanisms such as the ones without communication and with vanilla scheduling methods, e.g., round robin, ranging from 32% to 43%.

연구 동기 및 목표

  • 부분 관측 가능성 아래에서 작동하고 공통 목표를 달성하기 위해 통신이 필요한 다중 에이전트를 조정하는 방법을 다룬다.
  • 제한된 대역폭과 MAC 스타일 스케줄링이 필요한 공유 커뮤니케이션 매체 등의 실용적 제약을 다룬다.
  • 어떤 에이전트가 방송해야 하는지, 메시지를 인코딩하는 방법, 수신된 메시지를 기반으로 행동을 선택하는 방법을 학습한다.
  • 협력 성능을 향상시키기 위해 분산 실행과 함께 중앙집중식 학습을 촉진한다.]
  • method:[
  • Propose SchedNet, a deep MARL framework with three components per agent: a message encoder, an action selector, and a weight generator.
  • Introduce a weight-based scheduling algorithm (WSA) to select which K schedulable agents broadcast messages under limited bandwidth.
  • Use a centralized critic during training to estimate V(s) and Q(s,w) for guiding actor updates.
  • Train weight generators with DDPG to optimize scheduling weights given observations.
  • Implement two WSA variants: Top(k) and Softmax(k), realizable in a distributed manner via CSMA-like mechanisms.
  • Adopt an integrated architecture where encoders, action selectors, and weight generators are trained jointly under a common critic.]
  • research_questions:[
  • Can intelligent, learned scheduling of inter-agent communications improve cooperative MARL under bandwidth and MAC constraints?
  • How should agents encode messages and allocate broadcast opportunities to maximize collective reward?
  • Does centralized training with distributed execution enable effective coordination with scheduled communications?
  • How do scheduling policies (Top(k) vs Softmax(k)) affect performance and learned communication strategies in MARL tasks?
  • What levels of performance gains exist over non-communicative baselines and simple scheduling schemes?]
  • key_findings:[
  • SchedNet outperforms baselines that do not communicate (IDQN, COMA) and with simple scheduling (round robin).
  • In Predator-Prey, SchedNet with Top(1) yields up to 43% improvement over Round Robin scheduling.
  • In Cooperative Communication and Navigation, SchedNet significantly surpasses baselines, with Top(1) slightly outperforming Softmax(1).
  • The learned scheduling weights prioritize agents with greater observation horizons, demonstrating adaptive importance-based scheduling.
  • Messages from scheduled agents become more informative when the observed state includes exploitable information (e.g., prey location).
  • Deterministic Top(k) scheduling often provides larger gains than probabilistic Softmax(k) scheduling.]
  • table_headers: [],

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.