QUICK REVIEW

[論文レビュー] Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning

Stefan Elfwing, Eiji Uchibe|arXiv (Cornell University)|Feb 10, 2017

Reinforcement Learning in Robotics参考文献 21被引用数 72

ひとこと要約

強化学習におけるSiLUとdSiLU活性化関数を導入し、オンポリシーTD/Sarsa with eligibility tracesとソフトマックス選択を用いた場合、SZ-Tetris、10x10 Tetris、およびAtari 2600でDQN/DQN派生を上回ることができることを示す。

ABSTRACT

In recent years, neural networks have enjoyed a renaissance as function approximators in reinforcement learning. Two decades after Tesauro's TD-Gammon achieved near top-level human performance in backgammon, the deep reinforcement learning algorithm DQN achieved human-level performance in many Atari 2600 games. The purpose of this study is twofold. First, we propose two activation functions for neural network function approximation in reinforcement learning: the sigmoid-weighted linear unit (SiLU) and its derivative function (dSiLU). The activation of the SiLU is computed by the sigmoid function multiplied by its input. Second, we suggest that the more traditional approach of using on-policy learning with eligibility traces, instead of experience replay, and softmax action selection with simple annealing can be competitive with DQN, without the need for a separate target network. We validate our proposed approach by, first, achieving new state-of-the-art results in both stochastic SZ-Tetris and Tetris with a small 10$\times$10 board, using TD($λ$) learning and shallow dSiLU network agents, and, then, by outperforming DQN in the Atari 2600 domain by using a deep Sarsa($λ$) agent with SiLU and dSiLU hidden units.

研究の動機と目的

sigmoid-weighted linear units (SiLU)とその導関数(dSiLU)を強化学習におけるニューラルネットワーク近似器の活性化関数として用いる動機づけ。
オンポリシー TD(lambda)と Sarsa(lambda) 学習をエリジビリティ・トレースと比較して深層Q学習派生と比較する。
SiLU/dSiLUネットワークを用いてSZ-Tetris、10x10 Tetris、Atari 2600で最先端の性能を示す。
高次元ドメインにおけるソフトマックス選択とepsilon-greedy探索の影響を探る。

提案手法

SiLU活性化a_k(s) = z_k * sigma(z_k) を、ここで z_k は隠れユニット k の前活性化であると定義する。
dSiLU活性化a_k(s) = sigma(z_k) * (1 + z_k*(1 - sigma(z_k))) を定義する。
V^pi に対して TD(lambda)、Q^pi に対して Sarsa(lambda) を用い、勾配降下更新 theta_{t+1} = theta_t + alpha * delta_t * e_t およびエリジビリティ・トレース e_t を用いる。
SiLUとdSiLUの勾配を式(11)と(12)で計算する。
ソフトマックス選択をボルツマン分布で適用する。温度tauをエピソードごとにアニーリングする。
SZ-Tetris（浅いネットワークと深いネットワーク）、10x10 Tetris、Atari 2600をSiLU/dSiLUネットワークで評価する。

実験結果

リサーチクエスチョン

RQ1SiLUおよびdSiLU活性化関数は、従来の活性化関数（ReLU、シグモイド）と比較して強化学習の学習性能にどのような影響を与えるか。
RQ2エリジビリティ・トレースとソフトマックス選択を用いたオンポリシーTD(lambda)/Sarsa(lambda)は、ベンチマークタスクでDQN/Double DQNと競合できるか。
RQ3SiLU/dSiLUを用いた深層アーキテクチャは、SZ-Tetris、10x10 Tetris、Atari 2600で先行研究の最先端を上回るか。
RQ4SiLU/dSiLUネットワーク使用時に、ソフトマックス探索とepsilon-greedy探索の影響は、これらの領域でどうなるか。

主な発見

ネットワーク	最終平均スコア	最終最高スコア	備考
Shallow SiLU	214 ± 74	253 ± 83	SZ-Tetris, TD(lambda) with 50 hidden units
Shallow ReLU	191 ± 58	227 ± 76	SZ-Tetris, TD(lambda) with 50 hidden units
Shallow dSiLU	263 ± 80	320 ± 87	SZ-Tetris, TD(lambda) with 50 hidden units (state features)
Shallow Sigmoid	232 ± 75	293 ± 73	SZ-Tetris, TD(lambda) with 50 hidden units
Deep SiLU-SiLU	217 ± 53	219 ± 54	SZ-Tetris, two conv layers + 250 FC units
Deep ReLU-ReLU	215 ± 54	217 ± 52	SZ-Tetris, two conv layers + 250 FC units
Deep SiLU-dSiLU	229 ± 55	235 ± 54	SZ-Tetris, conv + 250 FC with SiLU in conv and dSiLU in FC
10x10 dSiLU	4,900 final mean; 5,300 best	—	10x10 Tetris, 250 hidden nodes, 400k episodes
Atari 12-games (SiLU-dSiLU)	Mean 332% (median 125%)	—	Compared to DQN, Gorila, and Double DQN

Shallow SiLU/dSiLUネットワークはSZ-TetrisでReLUおよびシグモイドよりも性能が高く、dSiLUが最終平均スコア263と最良実行320を達成。
Deep SiLU-dSiLUネットワークはSZ-TetrisでSiLU-SiLUとReLU-ReLUを上回り、平均最終スコアは229、以前の最先端を上回る性能。
10x10 Tetrisでは、250隠れユニットを持つdSiLUネットワークが新しい最先端の平均最終スコア4900と最良実行5300を達成。
Atari 2600で深いSiLU-dSiLUエージェントは、DQN/Double DQNを上回る平均および中央値のDQN正規化スコアを12ゲーム中で示した（平均332%、中央値125%）。
TD(lambda)とSarsa(lambda)は、Q学習ベースの手法で見られる最大値の過剰推定バイアスなしに、正確な値の推定を提供する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。