[論文レビュー] TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation
TVTはTransferability Adaptation Module (TAM)とDiscriminative Clustering Module (DCM)を導入し、Vision Transformersを効果的にドメイン間で適応させ、digitsとobject recognition benchmarksでベースラインを上回る。
Unsupervised domain adaptation (UDA) aims to transfer the knowledge learnt from a labeled source domain to an unlabeled target domain. Previous work is mainly built upon convolutional neural networks (CNNs) to learn domain-invariant representations. With the recent exponential increase in applying Vision Transformer (ViT) to vision tasks, the capability of ViT in adapting cross-domain knowledge, however, remains unexplored in the literature. To fill this gap, this paper first comprehensively investigates the transferability of ViT on a variety of domain adaptation tasks. Surprisingly, ViT demonstrates superior transferability over its CNNs-based counterparts with a large margin, while the performance can be further improved by incorporating adversarial adaptation. Notwithstanding, directly using CNNs-based adaptation strategies fails to take the advantage of ViT's intrinsic merits (e.g., attention mechanism and sequential image representation) which play an important role in knowledge transfer. To remedy this, we propose an unified framework, namely Transferable Vision Transformer (TVT), to fully exploit the transferability of ViT for domain adaptation. Specifically, we delicately devise a novel and effective unit, which we term Transferability Adaption Module (TAM). By injecting learned transferabilities into attention blocks, TAM compels ViT focus on both transferable and discriminative features. Besides, we leverage discriminative clustering to enhance feature diversity and separation which are undermined during adversarial domain alignment. To verify its versatility, we perform extensive studies of TVT on four benchmarks and the experimental results demonstrate that TVT attains significant improvements compared to existing state-of-the-art UDA methods.
研究の動機と目的
- Investigate ViT transferability across domain adaptation tasks compared to CNNs.
- Identify limitations of naïve adversarial alignment on ViT features.
- Design TAM to inject patch-level transferability into ViT attention for transferable and discriminative representations.
- Introduce DCM to preserve discriminative information while aligning domains.
- Demonstrate TVT effectiveness via extensive experiments on standard UDA benchmarks.
提案手法
- Use ViT as backbone with last transformer layer replaced by Transferability Adaptation Module (TAM).
- In TAM, compute patch-level transferability with a patch discriminator D_l and derive per-patch transferability t_ir = H(D_l(f_ir)).
- Replace standard Multi-head Self-Attention with Transferable MSA (T-MSA) by weighting patch tokens with transferability while preserving discriminative attention.
- Apply a Discriminative Clustering Module (DCM) to encourage target features to form well-separated clusters while maintaining global diversity via mutual information I(p^t; x^t).
- Optimize overall objective: L_clc(x^s,y^s) + α L_dis(x^s,x^t) + β L_pat(x^s,x^t) − γ I(p^t; x^t).
- Baseline comparison includes vanilla ViT with global adversarial alignment; TVT adds TAM and DCM for finer-grained transferability and discrimination.
実験結果
リサーチクエスチョン
- RQ1How transferable is ViT across domain shifts compared to CNN-based backbones?
- RQ2Can ViT benefit from adversarial alignment without destroying discriminative information?
- RQ3Can patch-level transferability and attention in ViT be leveraged to improve UDA performance?
- RQ4Does incorporating a discriminative clustering objective preserve target-domain discriminative structure during alignment?
主な発見
- ViT with Source Only already outperforms several CNN backbones on multiple UDA benchmarks (e.g., Office-31, Office-Home, VisDA-2017).
- Adversarial adaptation improves ViT (Baseline) but TAM+DCM yields further gains by leveraging ViT’s patch-level tokens and attention.
- On Digits, TVT achieves the best average accuracy across tasks (e.g., Avg = 98.87) and closes gap toward Target Only performance.
- On Office-31, TVT achieves Avg = 93.85, surpassing Baseline and Source Only.
- On Office-Home, TVT achieves Avg = 83.56, significantly higher than prior best (71.8%).
- On VisDA-2017, TVT attains Avg = 83.92, competitive with or exceeding strong baselines.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。