QUICK REVIEW

[论文解读] GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph

Xin Li, Dongze Lian|arXiv (Cornell University)|Sep 24, 2023

Multimodal Machine Learning Applications被引用 21

一句话总结

GraphAdapter 引入双模态知识图谱（文本和视觉），以引导视觉-语言模型微调中的文本适配器，使用 GCNs 融合模态内结构与跨模态结构，在 11 个基准上实现更好的少-shot 性能。

ABSTRACT

Adapter-style efficient transfer learning (ETL) has shown excellent performance in the tuning of vision-language models (VLMs) under the low-data regime, where only a few additional parameters are introduced to excavate the task-specific knowledge based on the general and powerful representation of VLMs. However, most adapter-style works face two limitations: (i) modeling task-specific knowledge with a single modality only; and (ii) overlooking the exploitation of the inter-class relationships in downstream tasks, thereby leading to sub-optimal solutions. To mitigate that, we propose an effective adapter-style tuning strategy, dubbed GraphAdapter, which performs the textual adapter by explicitly modeling the dual-modality structure knowledge (i.e., the correlation of different semantics/classes in textual and visual modalities) with a dual knowledge graph. In particular, the dual knowledge graph is established with two sub-graphs, i.e., a textual knowledge sub-graph, and a visual knowledge sub-graph, where the nodes and edges represent the semantics/classes and their correlations in two modalities, respectively. This enables the textual feature of each prompt to leverage the task-specific structure knowledge from both textual and visual modalities, yielding a more effective classifier for downstream tasks. Extensive experimental results on 11 benchmark datasets reveal that our GraphAdapter significantly outperforms previous adapter-based methods. The code will be released at https://github.com/lixinustc/GraphAdapter

研究动机与目标

为低数据情境下的 VLMs 提供高效迁移学习动机，而无需调整所有参数。
用文本和视觉结构知识对任务相关知识进行建模。
利用双模态图通过图卷积网络来为文本适配器提供信息。
在多样化数据集上展示相较于先前的适配器基础和提示式 ETL 方法的优越性能。

提出的方法

定义一个包含文本子图和视觉子图的双知识图，用以存储语义及其类别间关系。
从每个类别的平均提示构建文本节点，并通过文本特征的余弦相似度来连接边。
从每个类别的平均视觉特征构建视觉节点，并通过视觉特征的余弦相似度来连接边。
通过对文本和视觉图上的 GCN 变换对文本特征 z_t 进行扭曲，获得丰富的表征。
通过可学习的融合权重 beta 来融合模态内结构知识与跨模态结构知识，并应用带权重 alpha 的残差适配器。
仅训练 GCNs，同时用交叉熵损失优化分类。

实验结果

研究问题

RQ1显式的双模态结构图是否可以在少样本设置中改善 VLM 的任务相关知识提取？
RQ2将文本图与视觉图及其交互整合，对文本适配器的质量有何影响？
RQ3文本结构知识与视觉结构知识在下游分类中的相对重要性是多少？

主要发现

GraphAdapter 在 11 个少样本基准上持续优于先前的 ETL 方法（如 Prompt/Adapter 风格）。
在 16-shot 评估中，GraphAdapter 的平均值为 76.22%（相较于某些基线的 75.65–76.87%），在像 FGVCAircraft 这样的细粒度数据集上也显示出显著提升。
消融实验表明文本知识子图比视觉子图更关键，但同时建模两者可获得最佳结果。
GraphAdapter 在多种 CLIP 骨干（ResNet-50/101、ViT-B/32、ViT-B/16）上具备泛化能力，并在跨域测试（ImageNet-V2、-Sketch、-A、-R）中保持收益。
通过 GCNs 启用双模态结构知识与残差融合，是相较于先前适配器取得性能提升的关键。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。