[논문 리뷰] UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling
UniAdapter는 파라미터 효율적인 교차 모달 전이를 위해 unimodal 및 multimodal 어댑터를 통합하여 시각-언어 모델에서 경쟁력 있거나 우수한 결과를 달성하며, 조정 가능한 파라미터는 1.0%–2.0%에 불과하고 종종 전체 미세조정과 비교해 여섯 개의 교차모달 벤치마크에서 우수한 성능을 보인다.
Large-scale vision-language pre-trained models have shown promising transferability to various downstream tasks. As the size of these foundation models and the number of downstream tasks grow, the standard full fine-tuning paradigm becomes unsustainable due to heavy computational and storage costs. This paper proposes UniAdapter, which unifies unimodal and multimodal adapters for parameter-efficient cross-modal adaptation on pre-trained vision-language models. Specifically, adapters are distributed to different modalities and their interactions, with the total number of tunable parameters reduced by partial weight sharing. The unified and knowledge-sharing design enables powerful cross-modal representations that can benefit various downstream tasks, requiring only 1.0%-2.0% tunable parameters of the pre-trained model. Extensive experiments on 6 cross-modal downstream benchmarks (including video-text retrieval, image-text retrieval, VideoQA, and VQA) show that in most cases, UniAdapter not only outperforms the state-of-the-arts, but even beats the full fine-tuning strategy. Particularly, on the MSRVTT retrieval task, UniAdapter achieves 49.7% recall@1 with 2.2% model parameters, outperforming the latest competitors by 2.0%. The code and models are available at https://github.com/RERV/UniAdapter.
연구 동기 및 목표
- Motivate efficient transfer of large vision-language models to diverse cross-modal tasks without full fine-tuning.
- Propose UniAdapter to unify unimodal and multimodal adapters with knowledge sharing.
- Preserve language query integrity and handle video frame noise in cross-modal modeling.
- Demonstrate strong performance and parameter efficiency on multiple cross-modal benchmarks.
제안 방법
- Introduce UniAdapter with a unified down-projection layer shared across modalities and modality-specific up-projection layers.
- Incorporate Query-residual Adaption to preserve textual query information during cross-attention.
- Apply parameter-free Frame-aware Attention to weight frame tokens in video tasks without extra parameters.
- Share down-projection weights across modalities to enable cross-modal knowledge transfer while keeping up-projections modality-specific.
- Attach UniAdapters to visual, textual, and cross-modal encoders within a frozen BLIP-based vision-language backbone.
- Evaluate on six cross-modal tasks including video-text/image-text retrieval and VQA/VQA-related benchmarks.
실험 결과
연구 질문
- RQ1Can a unified, parameter-efficient adapter framework support diverse cross-modal downstream tasks (retrieval and reasoning) across image and video modalities?
- RQ2Does knowledge sharing of the adapter components improve cross-modal transfer while reducing tunable parameters?
- RQ3How do query residuals and frame-aware attention impact cross-modal performance in video-language tasks?
주요 결과
- UniAdapter achieves competitive or superior results with only 1.0%–2.0% tunable parameters compared to the frozen backbone.
- Inserting adapters in the multimodal encoder yields stronger gains than unary visual or textual adapters alone.
- Weight-sharing down-projection across modalities with modality-specific up-projections maintains performance while reducing tunable parameters.
- Query-residual adaption and parameter-free frame-aware attention further improve cross-modal performance without adding parameters.
- On MSRVTT retrieval, UniAdapter with 2.2% tunable parameters achieves 49.7% R@1, outperforming several competitors and surpassing some full fine-tuning baselines.
- Across six benchmarks (video-text retrieval, image-text retrieval, VideoQA, VQA), UniAdapter generally outperforms prior parameter-efficient methods and matches or exceeds full fine-tuning in many cases.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.