Skip to main content
QUICK REVIEW

[论文解读] UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling

Haoyu Lu, Yuqi Huo|arXiv (Cornell University)|Feb 13, 2023
Multimodal Machine Learning Applications被引用 20
一句话总结

UniAdapter 统一 unimodal 与 multimodal 适配器,用于 vision-language 模型的参数高效跨模态迁移,在只有 1.0%–2.0% 的可调参数下实现具有竞争力或优越的结果,并且在六个跨模态基准上常常优于全微调。

ABSTRACT

Large-scale vision-language pre-trained models have shown promising transferability to various downstream tasks. As the size of these foundation models and the number of downstream tasks grow, the standard full fine-tuning paradigm becomes unsustainable due to heavy computational and storage costs. This paper proposes UniAdapter, which unifies unimodal and multimodal adapters for parameter-efficient cross-modal adaptation on pre-trained vision-language models. Specifically, adapters are distributed to different modalities and their interactions, with the total number of tunable parameters reduced by partial weight sharing. The unified and knowledge-sharing design enables powerful cross-modal representations that can benefit various downstream tasks, requiring only 1.0%-2.0% tunable parameters of the pre-trained model. Extensive experiments on 6 cross-modal downstream benchmarks (including video-text retrieval, image-text retrieval, VideoQA, and VQA) show that in most cases, UniAdapter not only outperforms the state-of-the-arts, but even beats the full fine-tuning strategy. Particularly, on the MSRVTT retrieval task, UniAdapter achieves 49.7% recall@1 with 2.2% model parameters, outperforming the latest competitors by 2.0%. The code and models are available at https://github.com/RERV/UniAdapter.

研究动机与目标

  • Motivate efficient transfer of large vision-language models to diverse cross-modal tasks without full fine-tuning.
  • Propose UniAdapter to unify unimodal and multimodal adapters with knowledge sharing.
  • Preserve language query integrity and handle video frame noise in cross-modal modeling.
  • Demonstrate strong performance and parameter efficiency on multiple cross-modal benchmarks.

提出的方法

  • Introduce UniAdapter with a unified down-projection layer shared across modalities and modality-specific up-projection layers.
  • Incorporate Query-residual Adaption to preserve textual query information during cross-attention.
  • Apply parameter-free Frame-aware Attention to weight frame tokens in video tasks without extra parameters.
  • Share down-projection weights across modalities to enable cross-modal knowledge transfer while keeping up-projections modality-specific.
  • Attach UniAdapters to visual, textual, and cross-modal encoders within a frozen BLIP-based vision-language backbone.
  • Evaluate on six cross-modal tasks including video-text/image-text retrieval and VQA/VQA-related benchmarks.

实验结果

研究问题

  • RQ1Can a unified, parameter-efficient adapter framework support diverse cross-modal downstream tasks (retrieval and reasoning) across image and video modalities?
  • RQ2Does knowledge sharing of the adapter components improve cross-modal transfer while reducing tunable parameters?
  • RQ3How do query residuals and frame-aware attention impact cross-modal performance in video-language tasks?
  • RQ4如何在视频语言任务中,查询残差与帧感知注意力影响跨模态性能?

主要发现

  • UniAdapter achieves competitive or superior results with only 1.0%–2.0% tunable parameters compared to the frozen backbone.
  • Inserting adapters in the multimodal encoder yields stronger gains than unary visual or textual adapters alone.
  • Weight-sharing down-projection across modalities with modality-specific up-projections maintains performance while reducing tunable parameters.
  • Query-residual adaption and parameter-free frame-aware attention further improve cross-modal performance without adding parameters.
  • On MSRVTT retrieval, UniAdapter with 2.2% tunable parameters achieves 49.7% R@1, outperforming several competitors and surpassing some full fine-tuning baselines.
  • Across six benchmarks (video-text retrieval, image-text retrieval, VideoQA, VQA), UniAdapter generally outperforms prior parameter-efficient methods and matches or exceeds full fine-tuning in many cases.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。