QUICK REVIEW

[논문 리뷰] Soft Actor-Critic for Discrete Action Settings

Petros Christodoulou|arXiv (Cornell University)|2019. 10. 16.

Reinforcement Learning in Robotics참고 문헌 12인용 수 209

한 줄 요약

이 논문은 이산 행동 공간에 대해 SAC를 도출(SAC-Discrete)하고, 하이퍼파라미터 튜닝 없이 Atari 게임에서 Rainbow와 경쟁력 있는 샘플 효율을 보임을 보여준다.

ABSTRACT

Soft Actor-Critic is a state-of-the-art reinforcement learning algorithm for continuous action settings that is not applicable to discrete action settings. Many important settings involve discrete actions, however, and so here we derive an alternative version of the Soft Actor-Critic algorithm that is applicable to discrete action settings. We then show that, even without any hyperparameter tuning, it is competitive with the tuned model-free state-of-the-art on a selection of games from the Atari suite.

연구 동기 및 목표

Motivate: SAC excels in continuous-action RL but lacks discrete-action applicability.
Derive a discrete-action SAC variant by adjusting value, policy, and temperature updates.
Demonstrate SAC-Discrete efficiency on Atari games and compare to Rainbow under limited tuning.
Provide open-source implementation of SAC-Discrete.

제안 방법

Adapt soft Q-function to discrete actions by outputting Q-values for all actions: Q:S -> R^{|A|}.
Replace policy output with direct action distribution over A using softmax: pi:S -> [0,1]^{|A|}.
Compute V(s) and temperature loss with direct expectations: V(s)=pi(s)^{T}[Q(s)-alpha log pi(s)], alpha-loss J(alpha)=pi(s)^{T}[-alpha(log pi(s)+H)].
Remove reparameterisation trick since actions are discrete and expectations are tractable.
Use two soft Q-networks and take their minimum to mitigate overestimation.
Provide Algorithm 1 (SAC-Discrete) detailing updates for Q-functions, policy, and temperature.
Report hyperparameters and experimental setup for Atari (no tuning beyond values from prior work).

실험 결과

연구 질문

RQ1SAC를 이산 공간에 효과적으로 적용해 샘플 효율을 희생하지 않으면서 달성할 수 있는가?
RQ2Rainbow와 같은 튜닝된 강력한 기준선과 비교했을 때 SAC-Discrete는 Atari에서 샘플 효율 측면에서 어떤 성능을 보이는가?
RQ3이산 행동 SAC에서 낮은 분산, 안정적 학습을 유지하기 위한 어떤 구조적 및 알고리즘적 변화가 필요한가?
RQ4SAC-Discrete가 기존 이산 행동 알고리즘을 능가하거나 일치하기 위해 하이퍼파라미터 튜닝이 필요한가?

주요 결과

게임	무작위	Rainbow	SAC
Freeway	0.0	0.1	4.4
MsPacman	235.2	364.3	690.9
Enduro	0.0	0.53	0.8
BattleZone	2895.0	3363.5	4386.7
Qbert	166.1	235.6	280.5
Space Invaders	148.0	135.1	160.8
Beam Rider	372.1	365.6	432.1
Assault	233.7	300.3	350.0
James Bond	29.2	61.7	68.3
Seaquest	61.1	206.3	211.6
Asterix	248.8	285.7	272.0
Kangaroo	42.0	38.7	29.3
Alien	184.8	290.6	216.9
Road Runner	0.0	524.1	305.3
Frostbite	74.0	140.1	59.4
Amidar	11.8	20.8	7.9
Crazy Climber	7339.5	12558.3	3668.7
Breakout	0.9	3.3	0.7
UpNDown	488.4	1346.3	250.7
Pong	-20.4	-19.5	-20.98

SAC-Discrete는 다섯 시드로 20개의 Atari 게임에서 Rainbow에 비해 경쟁력 있는 샘플 효율을 달성.
20개 게임에서 SAC-Discrete는 10승, 중앙 성능 차이는 -1%, 범위는 최대 +4330%에서 -99%까지.
SAC-Discrete는 경쟁적 결과를 달성하기 위해 하이퍼파라미터 튜닝에 의존하지 않음.
이 논문은 공개 파이썬 구현(GitHub)을 제공.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.