QUICK REVIEW

[论文解读] Exponentially Increasing the Capacity-to-Computation Ratio for Conditional Computation in Deep Learning

Kyunghyun Cho, Yoshua Bengio|arXiv (Cornell University)|Jun 28, 2014

Neural Networks and Applications参考文献 26被引用 27

一句话总结

本文提出了一种神经网络权重重构的树状结构参数化方法，通过基于隐藏单元激活的位模式来激活参数，从而在计算量不变的情况下实现模型容量的指数级增长。该方法在理论上实现了相较于标准网络的 $\frac{2^k}{k}$ 倍容量-计算比提升，同时通过时间感知的权重衰减保持正则化。

ABSTRACT

Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.

研究动机与目标

解决深度神经网络中容量（参数量）与计算量线性增长的根本限制，从而阻碍模型的可扩展增长。
使深层网络能够在不显著增加推理或训练计算量的前提下，利用远超以往规模的模型和数据集。
结合深度分布式表征的统计效率与决策树的计算效率，实现指数级的容量-计算比。
设计一种可微分、可训练的条件计算机制，使参数量随计算量的增加而呈指数级增长，且计算开销极低。

提出的方法

使用基于隐藏单元激活符号位模式的树状结构向量表来参数化权重矩阵。
对每个单元，维护一组权重向量 $ T(j, \mathbf{b}_{1\ldots l}) $，其中 $ \mathbf{b} $ 是长度为 $ l $ 的二进制前缀，从而实现 $ 2^k $ 个总向量（$ k $-bit 前缀）。
通过基于输入激活符号的门控机制，从选中的向量中计算有效权重矩阵。
应用时间感知正则化：在不活跃期间，通过预乘 $ (1 - \epsilon\lambda)^{\Delta t} $ 来补偿被跳过的正则化步骤。
使用标准反向传播通过网络，将门控决策视为不可微，但仍能实现梯度向学习到的权重向量的流动。
探索替代的信用分配策略，包括基于 REINFORCE 的门控单元训练方法，以及受噪声 ReLU 启发的权重向量贡献调制机制。

实验结果

研究问题

RQ1我们能否设计一种参数化方法，使深层网络中的参数量相对于计算量呈指数级增长？
RQ2当使用指数级多的参数时，如何保持模型的泛化能力并避免过拟合？
RQ3是否存在一种有效且可微分的方法来训练选择使用哪些参数向量的门控机制？
RQ4我们能否在不带来计算开销剧增的前提下，显著提升容量-计算比？

主要发现

所提方法使自由度与计算量之比提升至 $ \frac{2^k}{k} $，该值随用于门控的符号位数 $ k $ 呈指数增长。
计算有效权重矩阵的计算开销为每单元 $ O(kq) $，与标准矩阵乘法所需的 $ O(pq) $ 乘加操作相比，属于合理开销。
通过追踪上次更新以来的时间，该方法可实现高效的正则化，并对不活跃的权重向量应用时间补偿衰减因子。
尽管尚需实验验证，但该方法在理论上是稳健的，且在语音与语言建模等大规模数据集上展现出巨大潜力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。