[论文解读] Towards Playing Full MOBA Games with Deep Reinforcement Learning
本文提出一个 MOBA AI 框架,通过课程自我博弈、策略蒸馏、离策略自适应、多头价值估计和蒙特卡洛树搜索,在大英雄池(最多40个英雄)实现完整 MOBA 游戏玩法,并在 Honor of Kings 对抗顶级电竞对手。
MOBA games, e.g., Honor of Kings, League of Legends, and Dota 2, pose grand challenges to AI systems such as multi-agent, enormous state-action space, complex action control, etc. Developing AI for playing MOBA games has raised much attention accordingly. However, existing work falls short in handling the raw game complexity caused by the explosion of agent combinations, i.e., lineups, when expanding the hero pool in case that OpenAI's Dota AI limits the play to a pool of only 17 heroes. As a result, full MOBA games without restrictions are far from being mastered by any existing AI system. In this paper, we propose a MOBA AI learning paradigm that methodologically enables playing full MOBA games with deep reinforcement learning. Specifically, we develop a combination of novel and existing learning techniques, including curriculum self-play learning, policy distillation, off-policy adaption, multi-head value estimation, and Monte-Carlo tree-search, in training and playing a large pool of heroes, meanwhile addressing the scalability issue skillfully. Tested on Honor of Kings, a popular MOBA game, we show how to build superhuman AI agents that can defeat top esports players. The superiority of our AI is demonstrated by the first large-scale performance test of MOBA AI agent in the literature.
研究动机与目标
- Address the scalability of learning to play full MOBA games with a large hero pool.
- Develop a unified actor-critic architecture capable of representing multiple heroes.
- Mitigate non-stationarity and combinatorial action space in multi-agent MOBA settings.
- Introduce curriculum-based self-play and policy distillation to stabilize and accelerate learning.
- Enable efficient drafting (hero selection) using MCTS and learned value predictors.
提出的方法
- Use an actor-critic network with hierarchical action heads and masks to handle MOBA's combinatorial actions.
- Apply off-policy Dual-clip PPO to stabilize learning from replayed experiences.
- Incorporate multi-head value estimation by decomposing rewards into five heads (Farming, KDA, Damage, Pushing, Win/Lose).
- Implement curriculum self-play learning (CSPL) with three phases: fixed lineups teacher training, multi-teacher distillation, and merged continued learning.
- Perform policy distillation where a student model learns from multiple fixed-lineup teacher models.
- Develop an MCTS-based drafting agent with a value network and a win-rate predictor to select heroes under a large pool.
- Adopt a distributed Actor-Learner infrastructure for scalable training with off-policy data.
实验结果
研究问题
- RQ1Can a MOBA AI learn to play with a large pool of heroes (up to 40) without performance collapse?
- RQ2How can curriculum self-play and distillation stabilize and accelerate multi-agent RL in MOBA?
- RQ3Does a multi-head value architecture improve value estimation in MOBA settings?
- RQ4Is MCTS-based drafting feasible and effective for large hero pools in MOBA?
- RQ5What is the empirical performance of the proposed MOBA AI against professional players and humans?
主要发现
- AI trained on a 40-hero pool defeated professional esports players with a 95.2% win-rate over 42 matches.
- AI achieved a 97.7% win-rate against top human players across 642,047 matches.
- CSPL improves scalability: 40-hero CSPL converges in ~336 hours vs. >480 hours for the baseline.
- Ablations show benefits from multi-head value estimation, off-policy adaptation, and CSPL.
- MCTS-based drafting outperformed random and win-rate-based drafting strategies.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。