QUICK REVIEW

[论文解读] Visual In-Context Learning for Large Vision-Language Models

Yucheng Zhou, Xiang Li|arXiv (Cornell University)|Feb 18, 2024

Multimodal Machine Learning Applications被引用 5

一句话总结

论文引入 Visual In-Context Learning (VICL) 通过使用 Visual Demonstration Retrieval、Intent-Oriented Image Summarization、Intent-Oriented Demonstration Composition 来提升 LVLM 的跨模态推理能力并实现上下文内的无学习。

ABSTRACT

In Large Visual Language Models (LVLMs), the efficacy of In-Context Learning (ICL) remains limited by challenges in cross-modal interactions and representation disparities. To overcome these challenges, we introduce a novel Visual In-Context Learning (VICL) method comprising Visual Demonstration Retrieval, Intent-Oriented Image Summarization, and Intent-Oriented Demonstration Composition. Our approach retrieves images via ''Retrieval & Rerank'' paradigm, summarises images with task intent and task-specific visual parsing, and composes language-based demonstrations that reduce token count and alleviate cross-modal interaction problem. Experimental evaluations on five visual reasoning datasets demonstrate the effectiveness of our method. Moreover, our extensive experiments leverage information flow analysis to elucidate the effectiveness of our method, and investigate the impact of length and position of demonstrations for LVLM. The use of in-context unlearning further shows promise in resetting specific model knowledge without retraining.

研究动机与目标

激发并解决 LVLM 在上下文学习（ICL）中的跨模态交互与表示空缺。
提出包含三个组成部分的 VICL：Visual Demonstration Retrieval、Intent-Oriented Image Summarization、和 Intent-Oriented Demonstration Composition。
证明 VICL 在五个视觉推理数据集上提升 LVLM 的准确率，并分析信息流与演示长度/排序。
展示在不进行模型再训练的情况下的上下文内无学习能力。

提出的方法

Visual Demonstration Retrieval 使用预训练的图像编码器来检索候选演示，并利用一个 VL-Enc 模型进行文本重排序以提高相关性。
Intent-Oriented Image Summarization (IOIS) 根据图像-问题-答案三元组生成与任务意图对齐的视觉摘要，以降低 LVLM 的认知负担。
Intent-Oriented Demonstration Composition (IODC) 将演示中的图像替换为图像摘要，并将 S_i、Q_i、A_i 拼接成统一的演示，以在标记长度限制内丰富上下文。
信息流分析（基于泰勒展开的显著性）评估 VICL 如何在层和头之间转移注意力与信息。
上下文内无学习实验测试模型在不重新训练的情况下通过演示丢弃错误标注信息的能力。

实验结果

研究问题

RQ1VICL 在多种 LVLM 和视觉推理数据集上是否优于标准 ICL 和零样本提示？
RQ2视觉演示检索、图像摘要与演示组合如何共同促成性能提升？
RQ3演示长度、顺序以及视觉摘要的类型对 LVLM 有何影响？
RQ4VICL 是否能在不更新模型的情况下有效实现上下文内无学习？

主要发现

VICL 在所有四个 LVLM 和五个数据集上始终优于零样本和 ICL。
IOIS-based summarization (and its variants) yields the best results, with IOIS achieving the highest gains.
Increasing the number of demonstrations generally benefits VICL more than ICL, with diminishing returns for ICL.
Demonstration order, especially head and tail positions, significantly affects accuracy across datasets.
In-context unlearning: VICL achieves the highest unlearning accuracy, showing robustness to mislabeled demonstrations.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。