QUICK REVIEW

[论文解读] TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation

Jinyu Yang, Jingjing Liu|arXiv (Cornell University)|Aug 12, 2021

Domain Adaptation and Few-Shot Learning参考文献 49被引用 24

一句话总结

TVT 引入 Transferability Adaptation Module (TAM) 和 Discriminative Clustering Module (DCM)，使 Vision Transformers 能在跨域中有效适配，在数字和对象识别基准上超越基线。

ABSTRACT

Unsupervised domain adaptation (UDA) aims to transfer the knowledge learnt from a labeled source domain to an unlabeled target domain. Previous work is mainly built upon convolutional neural networks (CNNs) to learn domain-invariant representations. With the recent exponential increase in applying Vision Transformer (ViT) to vision tasks, the capability of ViT in adapting cross-domain knowledge, however, remains unexplored in the literature. To fill this gap, this paper first comprehensively investigates the transferability of ViT on a variety of domain adaptation tasks. Surprisingly, ViT demonstrates superior transferability over its CNNs-based counterparts with a large margin, while the performance can be further improved by incorporating adversarial adaptation. Notwithstanding, directly using CNNs-based adaptation strategies fails to take the advantage of ViT's intrinsic merits (e.g., attention mechanism and sequential image representation) which play an important role in knowledge transfer. To remedy this, we propose an unified framework, namely Transferable Vision Transformer (TVT), to fully exploit the transferability of ViT for domain adaptation. Specifically, we delicately devise a novel and effective unit, which we term Transferability Adaption Module (TAM). By injecting learned transferabilities into attention blocks, TAM compels ViT focus on both transferable and discriminative features. Besides, we leverage discriminative clustering to enhance feature diversity and separation which are undermined during adversarial domain alignment. To verify its versatility, we perform extensive studies of TVT on four benchmarks and the experimental results demonstrate that TVT attains significant improvements compared to existing state-of-the-art UDA methods.

研究动机与目标

研究 ViT 在跨域自适应任务中的可迁移性，相较于 CNNs。
识别对 ViT 特征进行朴素对抗对齐的局限性。
设计 TAM，将补丁级可迁移性注入到 ViT 的注意力中，以获得可迁移且具辨别性的表示。
引入 DCM，在对齐域的同时保留辨别信息。
通过在标准的 UDA 基准上进行广泛实验来证明 TVT 的有效性。

提出的方法

以 ViT 为骨干网，将最后一层 Transformer 替换为 Transferability Adaptation Module (TAM)。
在 TAM 中，使用补丁判别器 D_l 计算补丁级可迁移性，并推导每个补丁的可迁移性 t_ir = H(D_l(f_ir))。
用可迁移的 MSA (T-MSA) 替换标准的多头自注意力，通过对补丁 token 加权来使用可迁移性，同时保留辨别性注意力。
应用 Discriminative Clustering Module (DCM) 以促使目标特征形成清晰分离的簇，同时通过互信息 I(p^t; x^t) 维持全局多样性。
优化总体目标：L_clc(x^s,y^s) + α L_dis(x^s,x^t) + β L_pat(x^s,x^t) − γ I(p^t; x^t)。
基线比较包括带全局对抗对齐的普通 ViT；TVT 增加 TAM 和 DCM，以实现更细粒度的可迁移性和辨识性。

实验结果

研究问题

RQ1相比于基于 CNN 的骨干，ViT 在跨域迁移中的可迁移性有多大？
RQ2ViT 能否在不破坏辨别信息的情况下从对抗对齐中受益？
RQ3是否可以利用 ViT 的补丁级可迁移性和注意力来提升 UDA 性能？
RQ4在对齐过程中引入辨识性聚类目标是否能保持目标域的辨识结构？

主要发现

仅使用 Source Only 的 ViT 已经在多项 UDA 基准上超越了若干 CNN 骨干网（例如 Office-31、Office-Home、VisDA-2017）。
对抗自适应改进 ViT（基线），但 TAM+DCM 通过利用 ViT 的补丁级 token 与注意力带来进一步提升。
在 Digits 上，TVT 在各任务上取得最佳平均准确率（例如 Avg = 98.87），并缩小了与 Target Only 的差距。
在 Office-31 上，TVT 达到 Avg = 93.85，超过 Baseline 和 Source Only。
在 Office-Home 上，TVT 达到 Avg = 83.56，显著高于此前最佳（71.8%）。
在 VisDA-2017 上，TVT 达到 Avg = 83.92，与强基线相竞争甚至超出。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。