QUICK REVIEW

[论文解读] Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations

Tri Dao, Albert Gu|arXiv (Cornell University)|Mar 14, 2019

Tensor decomposition and applications参考文献 48被引用 31

一句话总结

该论文提出了一种可微分的、参数化的蝴蝶分解方法，通过将结构化线性变换（如DFT、DCT、Hadamard变换和卷积）表示为稀疏、结构化的分块对角矩阵的乘积，自动学习快速的$O(N\log N)$算法。该方法在端到端机器学习模型中实现了最先进性能，在CIFAR-10上比无约束全连接层高出3.9个百分点的准确率，参数量减少40倍，推理速度提升4倍。

ABSTRACT

Fast linear transforms are ubiquitous in machine learning, including the discrete Fourier transform, discrete cosine transform, and other structured transformations such as convolutions. All of these transforms can be represented by dense matrix-vector multiplication, yet each has a specialized and highly efficient (subquadratic) algorithm. We ask to what extent hand-crafting these algorithms and implementations is necessary, what structural priors they encode, and how much knowledge is required to automatically learn a fast algorithm for a provided structured transform. Motivated by a characterization of fast matrix-vector multiplication as products of sparse matrices, we introduce a parameterization of divide-and-conquer methods that is capable of representing a large class of transforms. This generic formulation can automatically learn an efficient algorithm for many important transforms; for example, it recovers the $O(N \log N)$ Cooley-Tukey FFT algorithm to machine precision, for dimensions $N$ up to $1024$. Furthermore, our method can be incorporated as a lightweight replacement of generic matrices in machine learning pipelines to learn efficient and compressible transformations. On a standard task of compressing a single hidden-layer network, our method exceeds the classification accuracy of unconstrained matrices on CIFAR-10 by 3.9 points -- the first time a structured approach has done so -- with 4X faster inference speed and 40X fewer parameters.

研究动机与目标

自动学习结构化线性变换的高效、亚二次算法，无需手工设计。
通过端到端可微学习实现快速变换，减少对平台特定、手工优化库（如FFTW、cuFFT）的依赖。
探究结构化参数化是否能在真实世界机器学习基准中超越无约束的密集层。
表征通过稀疏矩阵分解学习快速算法所需的最小归纳偏置。
将可学习的快速变换层集成到深度学习流水线中，以提升效率和性能。

提出的方法

将线性变换参数化为$O(\log N)$个蝴蝶矩阵的乘积——即具有$O(N)$总参数量的结构化、稀疏分块对角矩阵。
使用可微优化框架，从目标变换的输入-输出对中学习蝴蝶参数。
利用蝴蝶分解的递归分治结构，实现$O(N\log N)$的矩阵-向量乘法。
将变换表示为$O\left(\log N\right)$个阶段的组合，每个阶段应用一个置换和一个包含$2^k$个大小为$N/2^k$的块的分块对角矩阵。
将蝴蝶层作为轻量级、可压缩的全连接层替代方案，集成到神经网络中。
在5行Python代码内实现快速乘法算法，无需针对特定变换进行调优，即可在任意硬件上高效推理。

实验结果

研究问题

RQ1通用的、可微分的参数化方法是否能为DFT、DCT等多样化结构化变换自动学习到$O(N\log N)$的快速算法？
RQ2结构化可学习矩阵分解在准确率、参数效率和推理速度方面，能在多大程度上超越无约束密集层？
RQ3相同的蝴蝶参数化能否在$N=1024$的实际规模下，以机器精度恢复已知的快速算法（如Cooley-Tukey FFT）？
RQ4学习得到的蝴蝶变换性能与高度优化的手动调优内核（如cuFFT和cuDNN）相比如何？
RQ5蝴蝶层是否能在不显著增加参数量的前提下，提升ResNet18等标准深度学习架构的性能？

主要发现

该方法在$N$最大为1024时，以机器精度恢复了$O(N\log N)$的Cooley-Tukey FFT算法，优于稀疏和低秩基线方法。
在CIFAR-10上，采用蝴蝶参数化的单层网络达到93.89%的准确率，比无约束全连接层高出3.9个百分点，同时参数量减少40倍。
在CIFAR-10上，蝴蝶层使ResNet18的准确率提升0.43个百分点，模型参数仅增加0.07%。
使用学习得到的蝴蝶矩阵进行推理，速度比密集GEMV快达两个数量级，且在CPU上与专用FFT/DCT内核相比仅慢3–5倍。
在GPU上，使用蝴蝶层的训练速度比密集GEMM快15%，且与FFT相比仅慢40%，展现出具有竞争力的训练效率。
该方法可泛化用于学习卷积和其他结构化变换，快速乘法算法无需针对特定变换进行优化。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。