QUICK REVIEW

[论文解读] Self-Chained Image-Language Model for Video Localization and Question Answering

Shoubin Yu, Jaemin Cho|arXiv (Cornell University)|May 11, 2023

Multimodal Machine Learning Applications被引用 25

一句话总结

SeViLA 使用单一图像-语言模型（BLIP-2）在视频中联合定位语言感知的关键帧并回答问题，通过前向定位和向后自我精炼，在若干视频QA基准测试中达到最先进水平。

ABSTRACT

Recent studies have shown promising results on utilizing large pre-trained image-language models for video question answering. While these image-language models can efficiently bootstrap the representation learning of video-language models, they typically concatenate uniformly sampled video frames as visual inputs without explicit language-aware, temporal modeling. When only a portion of a video input is relevant to the language query, such uniform frame sampling can often lead to missing important visual cues. Although humans often find a video moment to focus on and rewind the moment to answer questions, training a query-aware video moment localizer often requires expensive annotations and high computational costs. To address this issue, we propose Self-Chained Video Localization-Answering (SeViLA), a novel framework that leverages a single image-language model (BLIP-2) to tackle both temporal keyframe localization and QA on videos. SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2. We propose two ways of chaining these modules for cascaded inference and self-refinement. First, in the forward chain, the Localizer finds multiple language-aware keyframes in a video, which the Answerer uses to predict the answer. Second, in the reverse chain, the Answerer generates keyframe pseudo-labels to refine the Localizer, alleviating the need for expensive video moment localization annotations. Our SeViLA framework outperforms several strong baselines on 5 challenging video QA and event prediction benchmarks, and achieves the state-of-the-art in both fine-tuning (NExT-QA, STAR) and zero-shot (NExT-QA, STAR, How2QA, VLEP) settings. We also analyze the impact of Localizer, comparisons of Localizer with other temporal localization models, pre-training/self-refinement of Localizer, and varying the number of keyframes.

研究动机与目标

通过利用带时序定位的预训练图像-语言模型来推动高效的视频-语言学习。
引入一个语言感知的关键帧定位器(Localizer)和一个从 BLIP-2 微调得到的问题回答器(Answerer)。
通过前向链（Localizer -> Answerer）和反向链（基于伪标签的 Localizer 精炼）实现自我精炼。
在微调和零-shot 设置下，在多个视频QA和事件预测基准上展示出色性能。

提出的方法

以BLIP-2为主干，图像编码器和大型语言模型冻结，仅微调 Q-Formers 和每个模块的线性层。
Localizer 使用语言感知的提示和 LLM，对从均匀采样的帧中选择的前 K 个关键帧进行评分，以判断帧与回答相关性。
Answerer 将所选关键帧的特征拼接并输入给 LLM 以生成视频级答案。
前向链用来自 Localizer 的关键帧来训练 Answerer，以提升 QA 性能。
向后链基于 Answerer 的输出生成逐帧伪标签，以在无需显式逐帧注释的情况下对 Localizer 进行精炼。
对 Localizer 进行与时刻检索数据（QVHighlights）的预训练，以提供逐帧定位先验。
两阶段自我链式（前向推理与向后精炼）实现了更好的时序定位和 QA 精度。

实验结果

研究问题

RQ1是否可以将单一图像-语言模型改造成同时执行视频的时序定位和 QA？
RQ2语言感知的关键帧选择是否优于均匀帧采样，在视频 QA/事件预测方面带来提升？
RQ3来自 QA 输出的伪标签是否能够在没有逐帧注释的情况下有效地对语言感知的 Localizer 进行精炼？
RQ4在视频时刻检索数据上对 Localizer 的预训练对下游 QA 性能有何影响？
RQ5在多基准测试下，SeViLA 在微调和零-shot 设置中的表现如何？

主要发现

SeViLA 在五个视频 QA 和事件预测基准上超越了若干强基线。
零-shot Localizer + Answerer 在多个数据集的零-shot 设置中达到新的最先进水平（NExT-QA、STAR、How2QA、TVQA、VLEP）。
通过伪标签的自我精炼在各任务中持续提升 Localizer 的性能（消融中给出平均增益）。
使用语言感知关键帧的时序定位对 QA 准确率有显著提升，相较于均匀帧采样，尤其在对时序要求高的任务中。
Localizer 也可以作为一个强大的独立时刻检索模型，尽管在预训练阶段缺少显式的时序建模，仍显示出具竞争力的结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。