QUICK REVIEW

[论文解读] UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling

Haoyu Lu, Yuqi Huo|arXiv (Cornell University)|Feb 13, 2023

Multimodal Machine Learning Applications被引用 20

一句话总结

UniAdapter 统一 unimodal 与 multimodal 适配器，用于 vision-language 模型的参数高效跨模态迁移，在只有 1.0%–2.0% 的可调参数下实现具有竞争力或优越的结果，并且在六个跨模态基准上常常优于全微调。

ABSTRACT

Large-scale vision-language pre-trained models have shown promising transferability to various downstream tasks. As the size of these foundation models and the number of downstream tasks grow, the standard full fine-tuning paradigm becomes unsustainable due to heavy computational and storage costs. This paper proposes UniAdapter, which unifies unimodal and multimodal adapters for parameter-efficient cross-modal adaptation on pre-trained vision-language models. Specifically, adapters are distributed to different modalities and their interactions, with the total number of tunable parameters reduced by partial weight sharing. The unified and knowledge-sharing design enables powerful cross-modal representations that can benefit various downstream tasks, requiring only 1.0%-2.0% tunable parameters of the pre-trained model. Extensive experiments on 6 cross-modal downstream benchmarks (including video-text retrieval, image-text retrieval, VideoQA, and VQA) show that in most cases, UniAdapter not only outperforms the state-of-the-arts, but even beats the full fine-tuning strategy. Particularly, on the MSRVTT retrieval task, UniAdapter achieves 49.7% recall@1 with 2.2% model parameters, outperforming the latest competitors by 2.0%. The code and models are available at https://github.com/RERV/UniAdapter.

研究动机与目标

Motivate efficient transfer of large vision-language models to diverse cross-modal tasks without full fine-tuning.
Propose UniAdapter to unify unimodal and multimodal adapters with knowledge sharing.
Preserve language query integrity and handle video frame noise in cross-modal modeling.
Demonstrate strong performance and parameter efficiency on multiple cross-modal benchmarks.

提出的方法

Introduce UniAdapter with a unified down-projection layer shared across modalities and modality-specific up-projection layers.
Incorporate Query-residual Adaption to preserve textual query information during cross-attention.
Apply parameter-free Frame-aware Attention to weight frame tokens in video tasks without extra parameters.
Share down-projection weights across modalities to enable cross-modal knowledge transfer while keeping up-projections modality-specific.
Attach UniAdapters to visual, textual, and cross-modal encoders within a frozen BLIP-based vision-language backbone.
Evaluate on six cross-modal tasks including video-text/image-text retrieval and VQA/VQA-related benchmarks.

实验结果

研究问题

RQ1Can a unified, parameter-efficient adapter framework support diverse cross-modal downstream tasks (retrieval and reasoning) across image and video modalities?
RQ2Does knowledge sharing of the adapter components improve cross-modal transfer while reducing tunable parameters?
RQ3How do query residuals and frame-aware attention impact cross-modal performance in video-language tasks?
RQ4如何在视频语言任务中，查询残差与帧感知注意力影响跨模态性能？

主要发现

UniAdapter achieves competitive or superior results with only 1.0%–2.0% tunable parameters compared to the frozen backbone.
Inserting adapters in the multimodal encoder yields stronger gains than unary visual or textual adapters alone.
Weight-sharing down-projection across modalities with modality-specific up-projections maintains performance while reducing tunable parameters.
Query-residual adaption and parameter-free frame-aware attention further improve cross-modal performance without adding parameters.
On MSRVTT retrieval, UniAdapter with 2.2% tunable parameters achieves 49.7% R@1, outperforming several competitors and surpassing some full fine-tuning baselines.
Across six benchmarks (video-text retrieval, image-text retrieval, VideoQA, VQA), UniAdapter generally outperforms prior parameter-efficient methods and matches or exceeds full fine-tuning in many cases.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。