QUICK REVIEW

[论文解读] Is Attention Better Than Matrix Decomposition?

Zhengyang Geng, Meng-Hao Guo|arXiv (Cornell University)|Jan 1, 2021

Domain Adaptation and Few-Shot Learning参考文献 73被引用 52

一句话总结

该论文表明基于矩阵分解的全局上下文模块（Hamburger）在视觉任务中可达到或超过自注意力，且计算与内存成本更低，并提出通过 MD 循环进行训练的一步梯度。

ABSTRACT

As an essential ingredient of modern deep learning, attention mechanism, especially self-attention, plays a vital role in the global correlation discovery. However, is hand-crafted attention irreplaceable when modeling the global context? Our intriguing finding is that self-attention is not better than the matrix decomposition (MD) model developed 20 years ago regarding the performance and computational cost for encoding the long-distance dependencies. We model the global context issue as a low-rank recovery problem and show that its optimization algorithms can help design global information blocks. This paper then proposes a series of Hamburgers, in which we employ the optimization algorithms for solving MDs to factorize the input representations into sub-matrices and reconstruct a low-rank embedding. Hamburgers with different MDs can perform favorably against the popular global context module self-attention when carefully coping with gradients back-propagated through MDs. Comprehensive experiments are conducted in the vision tasks where it is crucial to learn the global context, including semantic segmentation and image generation, demonstrating significant improvements over self-attention and its variants.

研究动机与目标

重新评估在视觉和自然语言任务中，手工设计的注意力是否对建模全局上下文必不可少。
将全局上下文表述为一个低秩恢复问题，并通过矩阵分解求解以设计一个白盒模块。
开发 Hamburger，一个由 MD 求解器（VQ、CD、NMF）构建的轻量级全局上下文块，具备高效的反向传播。
在语义分割和图像生成任务中展示 Hamburger 的有效性，并与自注意力模块进行对比基准。

提出的方法

将全局上下文建模为展开的输入表示的低秩恢复，并通过矩阵分解求解以产生干净的低秩嵌入。
引入 Hamburger，它应用一个线性变换（下路），一个基于 MD 的 ham 块用于恢复低秩子空间，以及一个线性变换（上路）以产生输出。
用向量量化（VQ）、概念分解（CD）和非负矩阵分解（NMF）的可微变体对 MD ham 块进行实例化。
使用一步梯度，而不是通过时间的完整反向传播，来对迭代的 MD 求解器进行反向传播，以缓解梯度不稳定性。
保持 Hamburger 的 O(n) 复杂度，避免大规模 n×n 的注意力矩阵，降低相较于传统自注意力的内存使用量。

实验结果

研究问题

RQ1手工设计的注意力（自注意力）是否对建模全局上下文是必要的，还是基于矩阵分解的全局上下文也能够具有竞争力？
RQ2Hamburger 是否能够在分割和生成任务中实现与自注意力相当或更优的性能，同时降低计算和内存成本？
RQ3在神经网络中对迭代矩阵分解过程进行区分（微分）时，哪些训练策略是有效的（如一步梯度）？
RQ4不同的 MD 选择（VQ、CD、NMF）如何影响全局上下文建模的性能、效率和可解释性？

主要发现

建立在矩阵分解之上的 Hamburger，能够在语义分割和图像生成方面达到甚至竞争于自注意力的最新方法的表现。
在 PASCAL VOC 测试集上，基于 Hamburger 的 HamNet 达到 85.9% mIoU，超越了列出的若干基于注意力的模型。
在 PASCAL Context 验证集上，HamNet 达到 55.2% mIoU，超过了多种注意力模块。
在 ImageNet 128x128 的图像生成中，采用 NMF/一步梯度的 HamGAN 变体在 FID 指标上取得显著提升（例如 HamGAN-strong FID 14.77，HamGAN-baby 16.05）相比 SAGAN。
一步梯度方法稳定了训练，并能通过 MD 循环有效反向传播，避免了完整 BPTT 的不稳定性。
Hamburger 相较于传统自注意力模块，在内存和计算方面更低（O(ndr) 且没有大规模的 n×n 注意力矩阵）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。