QUICK REVIEW

[论文解读] A Survey of Learning in Multiagent Environments: Dealing with Non-Stationarity

Pablo Hernández-Leal, Michael Kaisers|arXiv (Cornell University)|Jul 28, 2017

Advanced Bandit Algorithms Research参考文献 156被引用 198

一句话总结

本综述回顾多智能体环境中的学习如何应对非平稳性，并引入一个五类框架来对方法进行分类。

ABSTRACT

The key challenge in multiagent learning is learning a best response to the behaviour of other agents, which may be non-stationary: if the other agents adapt their strategy as well, the learning target moves. Disparate streams of research have approached non-stationarity from several angles, which make a variety of implicit assumptions that make it hard to keep an overview of the state of the art and to validate the innovation and significance of new works. This survey presents a coherent overview of work that addresses opponent-induced non-stationarity with tools from game theory, reinforcement learning and multi-armed bandits. Further, we reflect on the principle approaches how algorithms model and cope with this non-stationarity, arriving at a new framework and five categories (in increasing order of sophistication): ignore, forget, respond to target models, learn models, and theory of mind. A wide range of state-of-the-art algorithms is classified into a taxonomy, using these categories and key characteristics of the environment (e.g., observability) and adaptation behaviour of the opponents (e.g., smooth, abrupt). To clarify even further we present illustrative variations of one domain, contrasting the strengths and limitations of each category. Finally, we discuss in which environments the different approaches yield most merit, and point to promising avenues of future research.

研究动机与目标

综合对手诱发的非平稳性在 bandits、强化学习和博弈论中的处理方式。
提出一个连贯的框架，用以在多智能体学习中对非平稳性处理进行分类。
基于环境和对手适应因素对最先进的算法进行分类。
讨论非平稳多智能体学习中的优势、局限性与未来研究方向。

提出的方法

回顾来自多臂赌博机、强化学习和博弈论的形式化模型，以构建非平稳性的框架。
提出一个具有五类的处理非平稳性的框架：ignore, forget, respond to target models, learn models, theory of mind。
用领域示例说明各类别，以突出优点和局限性。
按类别和环境/适应特征提供算法的分类法。
讨论未解问题和未来研究方向。

实验结果

研究问题

RQ1在不同领域（bandits、RL、博弈论）中，非平稳性如何在多智能体学习中产生？
RQ2哪一种框架最能捕捉对非平稳性处理的日益复杂化进程？
RQ3在不同可观测性和对手适应假设下，哪些算法与哪些类别对齐？
RQ4在非平稳多智能体学习中，关键的未解问题和有前景的研究方向是什么？

主要发现

提出一个五类框架来应对非平稳性：ignore, forget, respond to target opponents, learn opponent models, and theory of mind.
提供一个分类法，将来自 MABs、RL 和博弈论的最先进算法按类别和环境/对手适应进行分类。
使用示例性变体来对比各类别的优点和局限性。
分析哪些环境中不同方法能带来最大收益，并概述有前景的未来研究方向。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。