Skip to main content
QUICK REVIEW

[论文解读] SVD-ViT: Does SVD Make Vision Transformers Attend More to the Foreground?

Haruhiko Murata, Kazuhiro Hotta|arXiv (Cornell University)|Feb 2, 2026
Advanced Neural Network Applications被引用 0
一句话总结

SVD-ViT 将奇异值分解引入 ViT,通过 SPC token 和可选的 SSVA/ID-RSVD 模块强调前景特征,在若干细粒度与通用数据集上提升准确率。

ABSTRACT

Vision Transformers (ViT) have been established as large-scale foundation models. However, because self-attention operates globally, they lack an explicit mechanism to distinguish foreground from background. As a result, ViT may learn unnecessary background features and artifacts, leading to degraded classification performance. To address this issue, we propose SVD-ViT, which leverages singular value decomposition (SVD) to prioritize the learning of foreground features. SVD-ViT consists of three components- extbf{SPC module}, extbf{SSVA}, and extbf{ID-RSVD}-and suppresses task-irrelevant factors such as background noise and artifacts by extracting and aggregating singular vectors that capture object foreground information. Experimental results demonstrate that our method improves classification accuracy and effectively learns informative foreground representations while reducing the impact of background noise.

研究动机与目标

  • Motivate foreground-background separation in ViT due to global self-attention and background noise.
  • Introduce the SPC module that uses leading singular vectors to create an aggregation token for foreground emphasis.
  • Propose SSVA and ID-RSVD as enhancements to selectively integrate discriminative singular directions.
  • Demonstrate improved classification accuracy on five image recognition benchmarks compared to ViT baselines.

提出的方法

  • Apply randomized SVD (RSVD) to ViT intermediate features to obtain top left singular vectors that capture foreground structure.
  • Generate SPC tokens by projecting features onto the leading singular subspace and append them to patch tokens for subsequent Transformer layers.
  • Use SSVA to selectively mix and aggregate singular vectors into reduced bases conditioned on input signals.
  • Introduce ID-RSVD to make the sketching projection matrix input-dependent and optionally refine it with power iterations.
  • Insert SPC as a plugin between ViT encoder blocks and perform end-to-end fine-tuning on pretrained ViT models.
  • Evaluate on five datasets (CUB-200-2011, FGVC-Aircraft, Stanford Cars, Food-101, CIFAR-100) with full fine-tuning; use n=8 leading components and n' = 4 SPC tokens by default.
Figure 1 : Visualization of the leading left singular vectors obtained by applying SVD to the patch feature matrix (number of patches $\times$ embedding dimension) at each ViT layer. Each left singular vector is reshaped to the patch grid and rendered as a heatmap. From left to right, we show the in
Figure 1 : Visualization of the leading left singular vectors obtained by applying SVD to the patch feature matrix (number of patches $\times$ embedding dimension) at each ViT layer. Each left singular vector is reshaped to the patch grid and rendered as a heatmap. From left to right, we show the in

实验结果

研究问题

  • RQ1Can SVD-derived foreground representations improve ViT's robustness to background noise and artifacts?
  • RQ2Does inserting an SPC token between ViT blocks enhance foreground-aware aggregation without destabilizing training?
  • RQ3Do SSVA and ID-RSVD provide consistent gains across diverse datasets and layers?
  • RQ4How does SVD-ViT perform relative to ViT baselines on fine-grained and broad image classification tasks?

主要发现

  • SVD-ViT consistently improves over ViT baseline across five datasets.
  • On CUB-200-2011, SPC-based methods reach up to 2.52 percentage points higher accuracy than ViT CLS=1.
  • On FGVC-Aircraft, SPC alone yields up to 2.82 percentage points improvement.
  • Layer placement matters: inserting SPC around deeper layers (e.g., Layer 11) yields larger gains; inserting after the final layer can decrease accuracy.
  • SSVA and ID-RSVD provide dataset- and layer-dependent gains with mixed effectiveness across tasks.
  • Qualitative visualizations show leading singular vectors align with foreground structures and suppress background artifacts.
Figure 2 : Overview of RSVD. A low-rank approximation matrix is constructed via randomized sketching and iterative orthogonalization, and applying SVD to the resulting matrix enables extracting only the leading singular vectors.
Figure 2 : Overview of RSVD. A low-rank approximation matrix is constructed via randomized sketching and iterative orthogonalization, and applying SVD to the resulting matrix enables extracting only the leading singular vectors.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。