QUICK REVIEW

[论文解读] Convolution, attention and structure embedding

Jean‐Marc Andreoli|arXiv (Cornell University)|May 3, 2019

Stochastic Gradient Optimization Techniques参考文献 20被引用 19

一句话总结

本文提出了一套统一的数学框架，将卷积、注意力机制和结构嵌入统一为基于张量运算和混合积的单一算子的特例。研究表明，注意力机制等价于自适应、可学习的卷积，且提出在Transformer中的位置编码可被显式、可学习的移位矩阵替代，以实现对序列顺序更可解释的建模。

ABSTRACT

Deep neural networks are composed of layers of parametrised linear operations intertwined with non linear activations. In basic models, such as the multi-layer perceptron, a linear layer operates on a simple input vector embedding of the instance being processed, and produces an output vector embedding by straight multiplication by a matrix parameter. In more complex models, the input and output are structured and their embeddings are higher order tensors. The parameter of each linear operation must then be controlled so as not to explode with the complexity of the structures involved. This is essentially the role of convolution models, which exist in many flavours dependent on the type of structure they deal with (grids, networks, time series etc.). We present here a unified framework which aims at capturing the essence of these diverse models, allowing a systematic analysis of their properties and their mutual enrichment. We also show that attention models naturally fit in the same framework: attention is convolution in which the structure itself is adaptive, and learnt, instead of being given a priori.

研究动机与目标

将卷积、注意力和结构嵌入等不同深度学习操作统一于单一数学框架之下。
通过加权图和张量化表示形式化神经网络中的结构依赖关系。
通过低秩分解分析注意力和卷积中的参数效率。
探究Transformer中的位置编码是否可被可学习的、基于索引的基矩阵替代，以提升可解释性和性能。
证明注意力机制本质上是结构可学习而非预定义的卷积形式。

提出的方法

提出一种在任意结构（形式化为加权图族）上的通用卷积算子。
引入混合积运算 $\boldsymbol{a} \circ \boldsymbol{b} = \sum_k \boldsymbol{a}_k \otimes \boldsymbol{b}_k $，以分解高阶张量并施加低秩约束。
使用张量展开和矩阵化方法，将高阶运算映射为矩阵形式以供分析。
应用反演性质：若 $\boldsymbol{a}$ 构成形状为 $S$ 的张量空间的基，则任意形状为 $ST$ 的张量 $\boldsymbol{\Phi}$ 可唯一表示为 $\boldsymbol{\Phi} = \boldsymbol{a} \circ \boldsymbol{\Theta}$。
将Transformer中的自注意力与交叉注意力重新解释为共享参数化的双线性形式，通过分解降低参数量。
用可学习的移位矩阵（一维网格卷积基）替代位置编码，以直接建模标记顺序，避免使用复杂的可学习嵌入。

实验结果

研究问题

RQ1卷积、注意力和结构嵌入能否在统一的张量基框架下被正式统一？
RQ2混合积运算如何在神经网络层中实现低秩逼近与参数效率？
RQ3注意力机制在多大程度上可被视为一种自适应卷积，其中结构是可学习的？
RQ4位置编码在Transformer中的功能角色是什么？能否被显式、可学习的基矩阵替代？
RQ5Transformer中训练后的注意力头是否自然地学习到模拟移位矩阵的行为？若如此，是否暗示存在更直接的替代方案？

主要发现

本文确立了注意力机制等价于一种结构自适应且可学习的卷积，而非预先固定的结构。
Transformer的缩放点积注意力机制在形式上等价于一种具有特定低秩分解的双线性注意力机制，其参数矩阵 $\boldsymbol{\Lambda}_k$ 采用特定分解形式。
Transformer最终输出层（整合多个注意力头）在数学上等价于加权头的和，且共享线性投影，其约束遵循相同的分解原则。
该框架证明，若 $\boldsymbol{a}$ 构成形状为 $S$ 的张量空间的基，则任意形状为 $ST$ 的张量 $\boldsymbol{\Phi}$ 可被唯一分解为 $\boldsymbol{a} \circ \boldsymbol{\Theta}$。
实证证据表明，训练后的Transformer注意力头通常表现出类似移位矩阵的行为，支持了可直接通过可学习基矩阵建模此类操作的观点。
用显式、可学习的移位矩阵替代位置编码，为建模序列顺序提供了更具可解释性且潜在更高效的替代方案，优于传统可学习嵌入。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。