Skip to main content
QUICK REVIEW

[논문 리뷰] UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling

Haoyu Lu, Yuqi Huo|arXiv (Cornell University)|2023. 02. 13.
Multimodal Machine Learning Applications인용 수 20
한 줄 요약

UniAdapter는 파라미터 효율적인 교차 모달 전이를 위해 unimodal 및 multimodal 어댑터를 통합하여 시각-언어 모델에서 경쟁력 있거나 우수한 결과를 달성하며, 조정 가능한 파라미터는 1.0%–2.0%에 불과하고 종종 전체 미세조정과 비교해 여섯 개의 교차모달 벤치마크에서 우수한 성능을 보인다.

ABSTRACT

Large-scale vision-language pre-trained models have shown promising transferability to various downstream tasks. As the size of these foundation models and the number of downstream tasks grow, the standard full fine-tuning paradigm becomes unsustainable due to heavy computational and storage costs. This paper proposes UniAdapter, which unifies unimodal and multimodal adapters for parameter-efficient cross-modal adaptation on pre-trained vision-language models. Specifically, adapters are distributed to different modalities and their interactions, with the total number of tunable parameters reduced by partial weight sharing. The unified and knowledge-sharing design enables powerful cross-modal representations that can benefit various downstream tasks, requiring only 1.0%-2.0% tunable parameters of the pre-trained model. Extensive experiments on 6 cross-modal downstream benchmarks (including video-text retrieval, image-text retrieval, VideoQA, and VQA) show that in most cases, UniAdapter not only outperforms the state-of-the-arts, but even beats the full fine-tuning strategy. Particularly, on the MSRVTT retrieval task, UniAdapter achieves 49.7% recall@1 with 2.2% model parameters, outperforming the latest competitors by 2.0%. The code and models are available at https://github.com/RERV/UniAdapter.

연구 동기 및 목표

  • Motivate efficient transfer of large vision-language models to diverse cross-modal tasks without full fine-tuning.
  • Propose UniAdapter to unify unimodal and multimodal adapters with knowledge sharing.
  • Preserve language query integrity and handle video frame noise in cross-modal modeling.
  • Demonstrate strong performance and parameter efficiency on multiple cross-modal benchmarks.

제안 방법

  • Introduce UniAdapter with a unified down-projection layer shared across modalities and modality-specific up-projection layers.
  • Incorporate Query-residual Adaption to preserve textual query information during cross-attention.
  • Apply parameter-free Frame-aware Attention to weight frame tokens in video tasks without extra parameters.
  • Share down-projection weights across modalities to enable cross-modal knowledge transfer while keeping up-projections modality-specific.
  • Attach UniAdapters to visual, textual, and cross-modal encoders within a frozen BLIP-based vision-language backbone.
  • Evaluate on six cross-modal tasks including video-text/image-text retrieval and VQA/VQA-related benchmarks.

실험 결과

연구 질문

  • RQ1Can a unified, parameter-efficient adapter framework support diverse cross-modal downstream tasks (retrieval and reasoning) across image and video modalities?
  • RQ2Does knowledge sharing of the adapter components improve cross-modal transfer while reducing tunable parameters?
  • RQ3How do query residuals and frame-aware attention impact cross-modal performance in video-language tasks?

주요 결과

  • UniAdapter achieves competitive or superior results with only 1.0%–2.0% tunable parameters compared to the frozen backbone.
  • Inserting adapters in the multimodal encoder yields stronger gains than unary visual or textual adapters alone.
  • Weight-sharing down-projection across modalities with modality-specific up-projections maintains performance while reducing tunable parameters.
  • Query-residual adaption and parameter-free frame-aware attention further improve cross-modal performance without adding parameters.
  • On MSRVTT retrieval, UniAdapter with 2.2% tunable parameters achieves 49.7% R@1, outperforming several competitors and surpassing some full fine-tuning baselines.
  • Across six benchmarks (video-text retrieval, image-text retrieval, VideoQA, VQA), UniAdapter generally outperforms prior parameter-efficient methods and matches or exceeds full fine-tuning in many cases.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.