QUICK REVIEW

[论文解读] Searching for Activation Functions

Prajit Ramachandran, Barret Zoph|arXiv (Cornell University)|Oct 16, 2017

Domain Adaptation and Few-Shot Learning参考文献 39被引用 750

一句话总结

论文使用自动化搜索发现标量激活函数，提出 Swish (f(x)=x·sigmoid(βx))，并显示 Swish 在深层模型和任务中常常优于 ReLU。

ABSTRACT

The choice of activation functions in deep networks has a significant effect on the training dynamics and task performance. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU). Although various hand-designed alternatives to ReLU have been proposed, none have managed to replace it due to inconsistent gains. In this work, we propose to leverage automatic search techniques to discover new activation functions. Using a combination of exhaustive and reinforcement learning-based search, we discover multiple novel activation functions. We verify the effectiveness of the searches by conducting an empirical evaluation with the best discovered activation function. Our experiments show that the best discovered activation function, $f(x) = x \cdot ext{sigmoid}(βx)$, which we name Swish, tends to work better than ReLU on deeper models across a number of challenging datasets. For example, simply replacing ReLUs with Swish units improves top-1 classification accuracy on ImageNet by 0.9\% for Mobile NASNet-A and 0.6\% for Inception-ResNet-v2. The simplicity of Swish and its similarity to ReLU make it easy for practitioners to replace ReLUs with Swish units in any neural network.

研究动机与目标

激发激活函数对训练动力学和任务性能的影响。
提出基于搜索的方法来发现新颖的标量激活函数。
识别并验证通过穷举和基于强化学习的搜索发现的顶级激活函数。
在多种架构和数据集上展示 Swish 的经验收益。

提出的方法

设计一个由一元和二元函数构建的搜索空间，通过重复核心单元来组合激活函数。
对小空间进行穷举搜索，对大空间使用带强化学习的RNN控制器来提出候选函数。
训练一个子网络（如 CIFAR-10 的 ResNet-20）以通过验证准确度评估每个候选函数。
采用分布式训练并行化候选激活函数的训练，并基于奖励更新搜索策略。
将 Swish 定义为 f(x)=x·sigmoid(βx)，其中 β 可以是固定的或可训练的，并分析其性质及导数。

实验结果

研究问题

RQ1自动化搜索是否能够发现比手工设计的 ReLU 更优的激活函数？
RQ2搜索发现的高性能激活函数具有什么特征？
RQ3Swish 是否在超出搜索设定的多种模型家族和任务中具有泛化性？
RQ4Swish 与 ReLU 在像 ImageNet 这样的大规模数据集以及 NLP 翻译任务上的比较如何？

主要发现

Swish (f(x)=x·sigmoid(βx)) 常常在不同数据集和体系结构的深度网络上与 ReLU 相匹配或优于 ReLU。
β 固定为1的 Swish（Swish-1）或可训练 β 经常在 CIFAR-10/100、ImageNet 移动模型，以及若干 ImageNet 架构上优于 ReLU。
顶级激活函数往往简单（1–2 个核心单元），且常将原始预激活值 x 作为输入进入最终二元函数。
Swish 平滑、非单调、上界无限大，其梯度性质与 ReLU 不同，在实践中显示出有利的优化行为。
在 ImageNet 上，用 Swish 替换 ReLU 时，Mobile NASNet-A 的 top-1 准确率提高 0.9%，Inception-ResNet-v2 提高 0.6%。
Swish-1 和 Swish 在多种模型家族和任务上持续达到或超过基线，包括使用 Transformer 的机器翻译。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。