QUICK REVIEW

[論文レビュー] Searching for Activation Functions

Prajit Ramachandran, Barret Zoph|arXiv (Cornell University)|Oct 16, 2017

Domain Adaptation and Few-Shot Learning参考文献 39被引用数 750

ひとこと要約

この論文は自動検索を用いてスカラー活性化関数を発見し、Swish (f(x)=x·sigmoid(βx)) を導入し、Swish は深層モデルとタスク全体でReLUを上回ることが多いことを示します。

ABSTRACT

The choice of activation functions in deep networks has a significant effect on the training dynamics and task performance. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU). Although various hand-designed alternatives to ReLU have been proposed, none have managed to replace it due to inconsistent gains. In this work, we propose to leverage automatic search techniques to discover new activation functions. Using a combination of exhaustive and reinforcement learning-based search, we discover multiple novel activation functions. We verify the effectiveness of the searches by conducting an empirical evaluation with the best discovered activation function. Our experiments show that the best discovered activation function, $f(x) = x \cdot ext{sigmoid}(βx)$, which we name Swish, tends to work better than ReLU on deeper models across a number of challenging datasets. For example, simply replacing ReLUs with Swish units improves top-1 classification accuracy on ImageNet by 0.9\% for Mobile NASNet-A and 0.6\% for Inception-ResNet-v2. The simplicity of Swish and its similarity to ReLU make it easy for practitioners to replace ReLUs with Swish units in any neural network.

研究の動機と目的

活性化関数がトレーニングダイナミクスとタスク性能に与える影響を動機づける。
新規のスカラー活性化関数を発見するための検索ベースアプローチを提案する。
網羅探索および強化学習ベースの探索で見つかったトップの活性化関数を特定・検証する。
Swishの経験的利点を複数のアーキテクチャとデータセットで示す。

提案手法

1つのコアユニットを繰り返して活性化関数を組み合わせるユニary・binary関数から構成される探索空間を設計する。
小さな空間には全列挙探索を、巨大空間には強化学習で制御するRNNコントローラを用いて候補関数を提案する。
各候補を検証精度で評価するために子ネットワーク（例：CIFAR-10のResNet-20）を訓練する。
候補活性化関数の訓練を分散訓練で並列化し、報酬に基づいて探索方針を更新する。
Swishをf(x)=x·sigmoid(βx)と定義し、βを固定または学習可能とし、その特性と微分を分析する。

実験結果

リサーチクエスチョン

RQ1自動検索はReLUのような手作り設計を凌駕する活性化関数を発見できるか。
RQ2探索によって見つかった高性能な活性化関数の特徴は何か。
RQ3Swishは探索設定を超えて複数のモデルファミリやタスクに一般化するか。
RQ4ImageNetやNLP翻訳タスクのような大規模データセットでSwishはReLUとどう比較されるか。

主な発見

Swish (f(x)=x·sigmoid(βx)) はしばしば深いネットワークでReLUと同等かそれを上回る。
βを1に固定したSwish-1や学習可能なβを用いるSwishは、CIFAR-10/100、ImageNetモバイルモデル、およびいくつかのImageNetアーキテクチャでReLUを頻繁に上回る。
トップの活性化関数は単純で（1–2コアユニット）、最終的な二項関数への入力として生の事前活性化xを用いることが多い。
Swishは滑らかで非単調、上方へ有界でなく、その勾配特性はReLUと異なり、実務での最適化挙動が好ましい。
ImageNetでは、SwishはMobile NASNet-Aでトップ1精度を0.9%、Inception-ResNet-v2で0.6%向上させ、ReLUを置換した場合に効果を示す。
Swish-1およびSwishは、機械翻訳のTransformersを含む複数のモデルファミリとタスクでベースラインと同等以上を一貫して達成する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。