QUICK REVIEW

[论文解读] Thinking Fast and Slow with Deep Learning and Tree Search

Thomas Anthony, Tian Zheng|arXiv (Cornell University)|May 23, 2017

Artificial Intelligence in Games参考文献 15被引用 139

一句话总结

Expert Iteration (ExIt) 将规划和学习分离，通过将树搜索作为专家来训练一个神经网络实习者，实习者反过来指导搜索以改进未来的计划；该方法在 Hex 上表现出色并击败 MoHex。

ABSTRACT

Sequential decision making problems, such as structured prediction, robotic control, and game playing, require a combination of planning policies and generalisation of those plans. In this paper, we present Expert Iteration (ExIt), a novel reinforcement learning algorithm which decomposes the problem into separate planning and generalisation tasks. Planning new policies is performed by tree search, while a deep neural network generalises those plans. Subsequently, tree search is improved by using the neural network policy to guide search, increasing the strength of new plans. In contrast, standard deep Reinforcement Learning algorithms rely on a neural network not only to generalise plans, but to discover them too. We show that ExIt outperforms REINFORCE for training a neural network to play the board game Hex, and our final tree search agent, trained tabula rasa, defeats MoHex 1.0, the most recent Olympiad Champion player to be publicly released.

研究动机与目标

通过将规划与函数近似相结合来解决序列决策问题。
提出 Expert Iteration (ExIt) 将专家规划与学徒泛化分离。
证明由规划引导的神经网络能提高搜索效率和学习效率。
在 Hex 上演示 ExIt，并与 REINFORCE 和 MoHex 进行比较以确立竞争力。

提出的方法

将 Expert Iteration (ExIt) 定义为一个迭代循环：自我对弈以收集状态，模仿学习训练实习者，通过由实习者引导的树搜索改进专家。
将专家视为树搜索算法，实习者视为深度神经网络策略（以及可选的价值网络）。
使用模仿学习目标（CAT 与树策略目标，TPT）从专家的动作训练实习者；TPT 是代价敏感的。
采用在线数据聚合（类 DAgger）以提高数据效率并减少重新计算。
通过带有奖金项的修正 UCT 公式，用实习者策略来偏置树搜索。
扩展该框架以包含价值网络来估计叶值并与走子过程混合。
展示在线分布式 ExIt，并将批处理与在线变体与 REINFORCE 和 MoHex 进行对比。

实验结果

研究问题

RQ1ExIt 是否能在 Hex 中比标准策略梯度方法（如 REINFORCE）更快学习更强的策略？
RQ2将规划（专家）与泛化（实习者）分离是否能提高学习效率和最终表现？
RQ3在线（数据集聚合）与批处理 ExIt 在数据效率和稳定性方面有何差异？
RQ4在 ExIt 中仅策略网络与策略-值网络对搜索强度及对比强基准 MoHex 的性能有何影响？

主要发现

ExIt 在训练神经网络玩 Hex 上优于 REINFORCE。
最终的无先验 ExIt 智能体在一对一对战中击败 MoHex 1.0。
树策略目标（TPT）在模仿学习期间比所选动作目标（CAT）表现更强（初始数据后报告 Elo 提升约 50±13 分）。
带数据聚合的 DAgger 风格在线 ExIt 在数据效率和最终强度上优于批处理 ExIt。
使用策略网络来偏置 MCTS（神经网络 MCTS）显著提高胜率（例如强策略网络对比基线 MCTS 获胜率 97%）。
在实习者中加入价值网络显著提升专家质量，相较仅策略的 ExIt，带来更强的棋力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。