QUICK REVIEW

[论文解读] Convolutional Bypasses Are Better Vision Transformer Adapters

Shibo Jie, Zhihong Deng|arXiv (Cornell University)|Jul 14, 2022

Domain Adaptation and Few-Shot Learning被引用 62

一句话总结

Convpass 在 ViT 中插入轻量级可训练卷积旁路，在 VTAB-1K 和少样本任务中展现出优于面向语言的 PETL 方法的性能，且可训练参数极少，并展示出强劲的领域泛化能力。

ABSTRACT

The pretrain-then-finetune paradigm has been widely adopted in computer vision. But as the size of Vision Transformer (ViT) grows exponentially, the full finetuning becomes prohibitive in view of the heavier storage overhead. Motivated by parameter-efficient transfer learning (PETL) on language transformers, recent studies attempt to insert lightweight adaptation modules (e.g., adapter layers or prompt tokens) to pretrained ViT and only finetune these modules while the pretrained weights are frozen. However, these modules were originally proposed to finetune language models and did not take into account the prior knowledge specifically for visual tasks. In this paper, we propose to construct Convolutional Bypasses (Convpass) in ViT as adaptation modules, introducing only a small amount (less than 0.5% of model parameters) of trainable parameters to adapt the large ViT. Different from other PETL methods, Convpass benefits from the hard-coded inductive bias of convolutional layers and thus is more suitable for visual tasks, especially in the low-data regime. Experimental results on VTAB-1K benchmark and few-shot learning datasets show that Convpass outperforms current language-oriented adaptation modules, demonstrating the necessity to tailor vision-oriented adaptation modules for adapting vision models.

研究动机与目标

突出语言导向的 PETL 模块与 ViT 的视觉归纳偏置之间的错配。
提出 Convpass 作为一种面向视觉的 PETL 模块，在保留预训练权重的同时添加卷积归纳偏置。
展示 Convpass 在 VTAB-1K、少样本学习和领域泛化设置中的有效性。
表明 Convpass 能在训练参数更少的情况下超越现有 PETL 方法。

提出的方法

将 Convpass 作为并行的卷积瓶颈块插入到 ViT 块中，重构令牌的二维空间结构。
使用三层 Convpass：1x1 通道缩减，3x3 空间卷积，1x1 通道扩展。
通过将令牌视为 2D 补丁，将 [cls] 令牌视为图像，恢复 2D 结构。
仅训练 Convpass 模块和分类头，同时冻结预训练的 ViT 权重。
通过 ViT 的解展开视图分析 Convpass，显示包括 Convpass 与 MHSA/MLP 块在内的并行可训练路径。
将面向视觉的 Convpass 与面向语言的 PETL 模块（VPT、Adapter、AdaptFormer、LoRA、NOAH）进行比较。
在 VTAB-1K 上以 ImageNet-21K 上预训练的 ViT-B/16 以及额外的基于 CLIP 的领域泛化实验进行评估。

实验结果

研究问题

RQ1在对视觉任务进行微调 ViT 时，面向视觉的自适应模块是否能超过面向语言的 PETL 模块？
RQ2通过 Convpass 引入卷积归纳偏置是否提高数据效率，特别是在低数据情境（少样本和 VTAB-1K 子集）？
RQ3相比基线 PETL 方法，Convpass 如何影响领域泛化，包括像 CLIP 这样的视觉-语言模型？

主要发现

Convpass attn（与 MHSA 并排插入 Convpass）和 Convpass（与 MHSA/MLP 平行）在 VTAB-1K 上表现强劲，Convpass 在 PETL 方法中获得最佳平均结果。
Convpass attn 在 VTAB-1K 的 19 项任务中取得 12 项 state-of-the-art，且 Convpass（full）取得最佳平均性能，比上一代 SOTA（NOAH）在 VTAB-1K 任务上高出 1.1 个百分点。
Convpass 为 ViT-B/16（86M 主干）引入大约 0.33 百万的可训练参数，远小于全量微调却获得更高的准确性。
Convpass 在五个细粒度数据集上实现强劲的少样本学习增益，在大多数 shot 设置中优于基线，显示数据效率的提升。
在 CLIP 的领域泛化实验中，Convpass_CLIP 在源域和大多数目标域上优于若干 CLIP 定制化的 PETL 基线，显示对领域漂移的鲁棒性。
与具有固有视觉归纳偏置的骨干网变体（Swin、ConvNeXt）相比，ViT 配合 Convpass 在某些情况下可以超越对偏置骨干的全量微调，表明 Convpass 能有效弥补 ViT 缺乏视觉归纳偏置的问题。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。