[论文解读] The PokeAgent Challenge: Competitive and Long-Context Learning at Scale
PokéAgent Challenge 提供两个互补轨道——对战竞技和RPG解谜式跑图,以在大规模数据集、基线和NeurIPS 2025竞赛中评估部分可观测性与长期规划下的决策制定,揭示LLM、RL与人类之间的差距。
We present the PokeAgent Challenge, a large-scale benchmark for decision-making research built on Pokemon's multi-agent battle system and expansive role-playing game (RPG) environment. Partial observability, game-theoretic reasoning, and long-horizon planning remain open problems for frontier AI, yet few benchmarks stress all three simultaneously under realistic conditions. PokeAgent targets these limitations at scale through two complementary tracks: our Battling Track, which calls for strategic reasoning and generalization under partial observability in competitive Pokemon battles, and our Speedrunning Track, which requires long-horizon planning and sequential decision-making in the Pokemon RPG. Our Battling Track supplies a dataset of 20M+ battle trajectories alongside a suite of heuristic, RL, and LLM-based baselines capable of high-level competitive play. Our Speedrunning Track provides the first standardized evaluation framework for RPG speedrunning, including an open-source multi-agent orchestration system for modular, reproducible comparisons of harness-based LLM approaches. Our NeurIPS 2025 competition validates both the quality of our resources and the research community's interest in Pokemon, with over 100 teams competing across both tracks and winning solutions detailed in our paper. Participant submissions and our baselines reveal considerable gaps between generalist (LLM), specialist (RL), and elite human performance. Analysis against the BenchPress evaluation matrix shows that Pokemon battling is nearly orthogonal to standard LLM benchmarks, measuring capabilities not captured by existing suites and positioning Pokemon as an unsolved benchmark that can drive RL and LLM research forward. We transition to a living benchmark with a live leaderboard for Battling and self-contained evaluation for Speedrunning at https://pokeagentchallenge.com.
研究动机与目标
- 建立一个标准化、可扩展的决策制定基准,适用于动态、部分可观测的游戏环境。
- 提供大型、公开数据集和基线,以实现RL、LLM与混合方法之间的公平比较。
- 评估对战竞技与长时程RPG跑图,以识别当前AI范式的强项与弱点。
- 推动一个可持续的活基准,具备长久的排行榜与自包含评估,便于跟踪长期进展。
提出的方法
- 以Pokémon Showdown为基础的对战竞技与宝可梦绿宝石的长时程RPG跑图相结合的双轨设计。
- 公开发布大型数据集:400万人类示例与1800万条合成对战轨迹,以及20万+精心整理的竞技队伍。
- 基线覆盖启发式机器人、RL代理与依赖式LLM代理,并提供用于长时程RPG游玩的开源多代理编排系统。
- NeurIPS 2025竞赛在资源验证方面覆盖100+支队伍与10万+场对战,揭示通用LLM、专业RL与顶尖人类之间的差距。
- 包含一个实时对战排行榜和自包含跑图评估的活基准基础设施,资源托管在公开代码库。
实验结果
研究问题
- RQ1在部分可观测的高风险对战中,RL、LLM与混合方法在对战竞技中的表现差异为何?
- RQ2能否将长时程RPG任务标准化,以实现跨范式的公平、可重复评估?
- RQ3在对战与跑图两个轨道中,前沿LLM与专业RL方法之间存在哪些差距?
- RQ4LLMs在提供高层次规划方面的潜力有多大,RL如何在实时决策中对其进行细化以应对复杂环境?
主要发现
- 专业化的RL与搜索方法在对战竞技和跑图中均优于通用型LLMs。
- 在对战中,若无依赖式系统,原始前沿模型难以取得实质性进展;RL/MCST等方法主导了性能。
- 跑图领域的最佳方法(Heatz)在40:13完成路线,采用带模仿学习的脚本策略蒸馏与RL精炼,约比第二名快2倍。
- 基于依赖系统的LLM方法在规划方面可达到竞争性水平,但需要大量工具与任务分解;纯LLMs在时间和可靠性方面落后。
- 宝可梦对战几乎与标准LLM基准正交,表明在部分可观测性下的战略推理有独特的评估维度。
- 该基准揭示了LLMs在成千上万次序列决策中的恐慌行为与连贯性丧失等失败模式,这在传统基准中并不明显。
- 竞赛汇聚了超过100支队伍和650多名研究者,超过10万场对战,并有广泛的社区参与。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。