[论文解读] ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases
引入 gated positional self-attention (GPSA) 将卷积感知偏置软性注入视觉Transformer,相较 DeiT 在没有外部数据预训练的情况下提升样本效率和 ImageNet 性能。
Convolutional architectures have proven extremely successful for vision tasks. Their hard inductive biases enable sample-efficient learning, but come at the cost of a potentially lower performance ceiling. Vision Transformers (ViTs) rely on more flexible self-attention layers, and have recently outperformed CNNs for image classification. However, they require costly pre-training on large external datasets or distillation from pre-trained convolutional networks. In this paper, we ask the following question: is it possible to combine the strengths of these two architectures while avoiding their respective limitations? To this end, we introduce gated positional self-attention (GPSA), a form of positional self-attention which can be equipped with a ``soft" convolutional inductive bias. We initialise the GPSA layers to mimic the locality of convolutional layers, then give each attention head the freedom to escape locality by adjusting a gating parameter regulating the attention paid to position versus content information. The resulting convolutional-like ViT architecture, ConViT, outperforms the DeiT on ImageNet, while offering a much improved sample efficiency. We further investigate the role of locality in learning by first quantifying how it is encouraged in vanilla self-attention layers, then analysing how it is escaped in GPSA layers. We conclude by presenting various ablations to better understand the success of the ConViT. Our code and models are released publicly at https://github.com/facebookresearch/convit.
研究动机与目标
- 通过引入软卷积归纳偏置来桥接CNNs和ViTs。
- 开发GPSA层,可以初始化为卷积并逐渐更多依赖内容。
- 展示ConViT在准确率和样本效率上相比DeiT有提升且无需额外数据。
- 分析在GPSA与 vanilla 自注意力相比,局部性在学习中如何被捕获与逃逸。
- 提供消融研究以理解初始化、门控和GPSA放置的作用。
提出的方法
- 定义将内容项和位置信息项与可学习门控 λ_h 相结合的 gated positional self-attention (GPSA)。
- 将 GPSA 初始化为模仿卷积核(卷积初始化)并保持相对位置编码固定。
- 引入门控机制,在 softmax 之后混合基于位置的注意力和基于内容的注意力(Eq. 7)。
- 在基于 DeiT 的架构中用 GPSA 层替换一部分 ViT 自注意力层,构建 ConViT。
- 通过非局部性度量分析局部性动态,并检查各层(以及头部)的门控参数。
- 提供开源代码和预训练模型以便复现。
实验结果
研究问题
- RQ1是否可以在不将模型锁定为固定 CNN 式模式的前提下,将软的、可学习的卷积归纳偏置整合到 Vision Transformer 中?
- RQ2在 vanilla 自注意力中,局部化如何产生,GPSA 层在训练过程中如何逃离局部性?
- RQ3基于 GPSA 的 ConViT 是否在数据受限情形下相比 DeiT 提供更好的样本效率和具有竞争力的准确性?
主要发现
| 模型 | N_h | D_emb | 规模 | FLOPs | 速度 | Top-1 | Top-5 |
|---|---|---|---|---|---|---|---|
| DeiT | 3 | 192 | 6M | 1G | 1442 | 72.2 | - |
| ConViT | 4 | 192 | 6M | 1G | 734 | 73.1 | 91.7 |
| DeiT Ti+ | 4 | 256 | 10M | 2G | 1036 | 75.9 | 93.2 |
| ConViT Ti+ | 4 | 256 | 10M | 2G | 625 | 76.7 | 93.6 |
| DeiT S | 6 | 384 | 22M | 4.3G | 587 | 79.8 | - |
| ConViT S+ | 9 | 432 | 27M | 5.4G | 305 | 81.3 | 95.7 |
| DeiT S+ | 9 | 576 | 48M | 10G | 480 | 79.0 | 94.4 |
| ConViT S+ | 9 | 576 | 48M | 10G | 382 | 82.2 | 95.9 |
| DeiT B | 12 | 768 | 86M | 17G | 187 | 81.8 | - |
| ConViT B | 16 | 768 | 86M | 17G | 141 | 82.4 | 95.9 |
| DeiT B+ | 16 | 1024 | 152M | 30G | 114 | 77.5 | 93.5 |
| ConViT B+ | 16 | 1024 | 152M | 30G | 96 | 82.5 | 95.9 |
- ConViT 在相同大小和计算条件下优于 DeiT,在 ImageNet 的 Top-1/Top-5 在若干配置中表现更好。
- ConViT-S+ 达到 82.2% 的 top-1(相较 DeiT-S 的 81.4%),相较于某些基线的吞吐量有所提升。
- GPSA 层提供软性、可控的卷积偏置,提升早期训练动力学和样本效率。
- 门控参数揭示在早期层中有更多依赖位置信息的头,随后转向内容信息。
- 消融研究表明卷积初始化和门控共同为增益做出贡献,尤其在低数据条件下。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。