[论文解读] UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling
UniAdapter 统一 unimodal 与 multimodal 适配器,用于 vision-language 模型的参数高效跨模态迁移,在只有 1.0%–2.0% 的可调参数下实现具有竞争力或优越的结果,并且在六个跨模态基准上常常优于全微调。
Large-scale vision-language pre-trained models have shown promising transferability to various downstream tasks. As the size of these foundation models and the number of downstream tasks grow, the standard full fine-tuning paradigm becomes unsustainable due to heavy computational and storage costs. This paper proposes UniAdapter, which unifies unimodal and multimodal adapters for parameter-efficient cross-modal adaptation on pre-trained vision-language models. Specifically, adapters are distributed to different modalities and their interactions, with the total number of tunable parameters reduced by partial weight sharing. The unified and knowledge-sharing design enables powerful cross-modal representations that can benefit various downstream tasks, requiring only 1.0%-2.0% tunable parameters of the pre-trained model. Extensive experiments on 6 cross-modal downstream benchmarks (including video-text retrieval, image-text retrieval, VideoQA, and VQA) show that in most cases, UniAdapter not only outperforms the state-of-the-arts, but even beats the full fine-tuning strategy. Particularly, on the MSRVTT retrieval task, UniAdapter achieves 49.7% recall@1 with 2.2% model parameters, outperforming the latest competitors by 2.0%. The code and models are available at https://github.com/RERV/UniAdapter.
研究动机与目标
- Motivate efficient transfer of large vision-language models to diverse cross-modal tasks without full fine-tuning.
- Propose UniAdapter to unify unimodal and multimodal adapters with knowledge sharing.
- Preserve language query integrity and handle video frame noise in cross-modal modeling.
- Demonstrate strong performance and parameter efficiency on multiple cross-modal benchmarks.
提出的方法
- Introduce UniAdapter with a unified down-projection layer shared across modalities and modality-specific up-projection layers.
- Incorporate Query-residual Adaption to preserve textual query information during cross-attention.
- Apply parameter-free Frame-aware Attention to weight frame tokens in video tasks without extra parameters.
- Share down-projection weights across modalities to enable cross-modal knowledge transfer while keeping up-projections modality-specific.
- Attach UniAdapters to visual, textual, and cross-modal encoders within a frozen BLIP-based vision-language backbone.
- Evaluate on six cross-modal tasks including video-text/image-text retrieval and VQA/VQA-related benchmarks.
实验结果
研究问题
- RQ1Can a unified, parameter-efficient adapter framework support diverse cross-modal downstream tasks (retrieval and reasoning) across image and video modalities?
- RQ2Does knowledge sharing of the adapter components improve cross-modal transfer while reducing tunable parameters?
- RQ3How do query residuals and frame-aware attention impact cross-modal performance in video-language tasks?
- RQ4如何在视频语言任务中,查询残差与帧感知注意力影响跨模态性能?
主要发现
- UniAdapter achieves competitive or superior results with only 1.0%–2.0% tunable parameters compared to the frozen backbone.
- Inserting adapters in the multimodal encoder yields stronger gains than unary visual or textual adapters alone.
- Weight-sharing down-projection across modalities with modality-specific up-projections maintains performance while reducing tunable parameters.
- Query-residual adaption and parameter-free frame-aware attention further improve cross-modal performance without adding parameters.
- On MSRVTT retrieval, UniAdapter with 2.2% tunable parameters achieves 49.7% R@1, outperforming several competitors and surpassing some full fine-tuning baselines.
- Across six benchmarks (video-text retrieval, image-text retrieval, VideoQA, VQA), UniAdapter generally outperforms prior parameter-efficient methods and matches or exceeds full fine-tuning in many cases.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。