QUICK REVIEW

[论文解读] Soft Actor-Critic for Discrete Action Settings

Petros Christodoulou|arXiv (Cornell University)|Oct 16, 2019

Reinforcement Learning in Robotics参考文献 12被引用 209

一句话总结

本文推导了离散动作空间的 SAC（SAC-Discrete），并在 Atari 游戏上展示其与最前沿的 Rainbow 在样本效率方面具竞争力，且无需超参数调优。

ABSTRACT

Soft Actor-Critic is a state-of-the-art reinforcement learning algorithm for continuous action settings that is not applicable to discrete action settings. Many important settings involve discrete actions, however, and so here we derive an alternative version of the Soft Actor-Critic algorithm that is applicable to discrete action settings. We then show that, even without any hyperparameter tuning, it is competitive with the tuned model-free state-of-the-art on a selection of games from the Atari suite.

研究动机与目标

动机：SAC 在连续动作强化学习中表现出色，但在离散动作的适用性方面存在不足。
通过调整价值函数、策略和温度更新，推导出一个离散动作的 SAC 变体。
在 Atari 游戏上展示 SAC-Discrete 的效率，并在有限的调参条件下与 Rainbow 进行比较。
提供 SAC-Discrete 的开源实现。

提出的方法

通过输出所有动作的 Q 值来将 soft Q 函数适配到离散动作：Q:S -> R^{|A|}。
用 softmax 直接给出对 A 的动作分布来替代策略输出：pi:S -> [0,1]^{|A|}。
使用直接期望来计算 V(s) 和温度损失：V(s)=pi(s)^{T}[Q(s)-alpha log pi(s)], alpha-loss J(alpha)=pi(s)^{T}[-alpha(log pi(s)+H)].
删除重参数化技巧，因为动作是离散的且期望可以解算。
使用两个 soft Q 网络并取它们的最小值以减轻过估计。
提供 Algorithm 1（SAC-Discrete）并详细说明 Q 函数、策略和温度的更新。
报告 Atari 的超参数和实验设置（除了前人工作给出的取值外不再调参）。

实验结果

研究问题

RQ1SAC 能否在不牺牲样本效率的前提下有效适配离散动作空间？
RQ2与经过调参的强基线（Rainbow）相比，SAC-Discrete 在 Atari 上的样本效率表现如何？
RQ3为了在离散行动 SAC 中维持低方差、稳定学习，需要哪些架构和算法上的改动？
RQ4SAC-Discrete 是否需要超参数调优以优于或达到现有的离散动作算法？

主要发现

游戏	随机	Rainbow	SAC
Freeway	0.0	0.1	4.4
MsPacman	235.2	364.3	690.9
Enduro	0.0	0.53	0.8
BattleZone	2895.0	3363.5	4386.7
Qbert	166.1	235.6	280.5
Space Invaders	148.0	135.1	160.8
Beam Rider	372.1	365.6	432.1
Assault	233.7	300.3	350.0
James Bond	29.2	61.7	68.3
Seaquest	61.1	206.3	211.6
Asterix	248.8	285.7	272.0
Kangaroo	42.0	38.7	29.3
Alien	184.8	290.6	216.9
Road Runner	0.0	524.1	305.3
Frostbite	74.0	140.1	59.4
Amidar	11.8	20.8	7.9
Crazy Climber	7339.5	12558.3	3668.7
Breakout	0.9	3.3	0.7
UpNDown	488.4	1346.3	250.7
Pong	-20.4	-19.5	-20.98

SAC-Discrete 在 20 个 Atari 游戏上相对 Rainbow 实现了有竞争力的样本效率，且使用五个种子。
在这 20 个游戏中，SAC-Discrete 获胜 10 次，中位性能差为 -1%，变化范围从最高 +4330% 到最低 -99%。
SAC-Discrete 不依赖超参数调优即可达到有竞争力的结果。
论文提供了公开的 Python 实现（GitHub）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。