QUICK REVIEW

[論文レビュー] Soft Actor-Critic for Discrete Action Settings

Petros Christodoulou|arXiv (Cornell University)|Oct 16, 2019

Reinforcement Learning in Robotics参考文献 12被引用数 209

ひとこと要約

この論文は離散アクション空間のSAC（SAC-Discrete）を導出し、ハイパーパラメータ調整なしでAtariゲームにおける最先端Rainbowとサンプル効率が競合的であることを示す。

ABSTRACT

Soft Actor-Critic is a state-of-the-art reinforcement learning algorithm for continuous action settings that is not applicable to discrete action settings. Many important settings involve discrete actions, however, and so here we derive an alternative version of the Soft Actor-Critic algorithm that is applicable to discrete action settings. We then show that, even without any hyperparameter tuning, it is competitive with the tuned model-free state-of-the-art on a selection of games from the Atari suite.

研究の動機と目的

Motivate: SAC excels in continuous-action RL but lacks discrete-action applicability.
Derive a discrete-action SAC variant by adjusting value, policy, and temperature updates.
Demonstrate SAC-Discrete efficiency on Atari games and compare to Rainbow under limited tuning.
Provide open-source implementation of SAC-Discrete.

提案手法

Adapt soft Q-function to discrete actions by outputting Q-values for all actions: Q:S -> R^{|A|}.
Replace policy output with direct action distribution over A using softmax: pi:S -> [0,1]^{|A|}.
Compute V(s) and temperature loss with direct expectations: V(s)=pi(s)^{T}[Q(s)-alpha log pi(s)], alpha-loss J(alpha)=pi(s)^{T}[-alpha(log pi(s)+H)].
Remove reparameterisation trick since actions are discrete and expectations are tractable.
Use two soft Q-networks and take their minimum to mitigate overestimation.
Provide Algorithm 1 (SAC-Discrete) detailing updates for Q-functions, policy, and temperature.
Report hyperparameters and experimental setup for Atari (no tuning beyond values from prior work).

実験結果

リサーチクエスチョン

RQ1Can SAC be effectively adapted to discrete action spaces without sacrificing sample efficiency?
RQ2How does SAC-Discrete perform on Atari compared with a tuned, strong baseline (Rainbow) in terms of sample efficiency?
RQ3What architectural and algorithmic changes are needed to maintain low-variance, stable learning in discrete action SAC?
RQ4Does SAC-Discrete require hyperparameter tuning to outperform or match existing discrete-action algorithms?

主な発見

Game	Random	Rainbow	SAC
Freeway	0.0	0.1	4.4
MsPacman	235.2	364.3	690.9
Enduro	0.0	0.53	0.8
BattleZone	2895.0	3363.5	4386.7
Qbert	166.1	235.6	280.5
Space Invaders	148.0	135.1	160.8
Beam Rider	372.1	365.6	432.1
Assault	233.7	300.3	350.0
James Bond	29.2	61.7	68.3
Seaquest	61.1	206.3	211.6
Asterix	248.8	285.7	272.0
Kangaroo	42.0	38.7	29.3
Alien	184.8	290.6	216.9
Road Runner	0.0	524.1	305.3
Frostbite	74.0	140.1	59.4
Amidar	11.8	20.8	7.9
Crazy Climber	7339.5	12558.3	3668.7
Breakout	0.9	3.3	0.7
UpNDown	488.4	1346.3	250.7
Pong	-20.4	-19.5	-20.98

SAC-Discrete achieves competitive sample efficiency relative to Rainbow on 20 Atari games with five seeds.
Across 20 games, SAC-Discrete wins in 10, with a median performance difference of -1% and a range up to +4330% and down to -99%.
SAC-Discrete does not rely on hyperparameter tuning to reach competitive results.
The paper provides a public Python implementation (GitHub).

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。