QUICK REVIEW

[논문 리뷰] Self-Chained Image-Language Model for Video Localization and Question Answering

Shoubin Yu, Jaemin Cho|arXiv (Cornell University)|2023. 05. 11.

Multimodal Machine Learning Applications인용 수 25

한 줄 요약

SeViLA는 단일 이미지-언어 모델(BLIP-2)을 사용하여 비디오에서 언어 인지 키 프레임을 공동으로 로컬라이즈하고 질문에 답합니다. 앞으로 로컬라이제이션과 뒤로 가는 자기 개선을 통해 여러 비디오 QA 벤치에서 최첨단 성능을 달성합니다.

ABSTRACT

Recent studies have shown promising results on utilizing large pre-trained image-language models for video question answering. While these image-language models can efficiently bootstrap the representation learning of video-language models, they typically concatenate uniformly sampled video frames as visual inputs without explicit language-aware, temporal modeling. When only a portion of a video input is relevant to the language query, such uniform frame sampling can often lead to missing important visual cues. Although humans often find a video moment to focus on and rewind the moment to answer questions, training a query-aware video moment localizer often requires expensive annotations and high computational costs. To address this issue, we propose Self-Chained Video Localization-Answering (SeViLA), a novel framework that leverages a single image-language model (BLIP-2) to tackle both temporal keyframe localization and QA on videos. SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2. We propose two ways of chaining these modules for cascaded inference and self-refinement. First, in the forward chain, the Localizer finds multiple language-aware keyframes in a video, which the Answerer uses to predict the answer. Second, in the reverse chain, the Answerer generates keyframe pseudo-labels to refine the Localizer, alleviating the need for expensive video moment localization annotations. Our SeViLA framework outperforms several strong baselines on 5 challenging video QA and event prediction benchmarks, and achieves the state-of-the-art in both fine-tuning (NExT-QA, STAR) and zero-shot (NExT-QA, STAR, How2QA, VLEP) settings. We also analyze the impact of Localizer, comparisons of Localizer with other temporal localization models, pre-training/self-refinement of Localizer, and varying the number of keyframes.

연구 동기 및 목표

효율적인 비디오-언어 학습을 위해 사전 학습된 이미지-언어 모델과 시간적 로컬라이제이션을 활용한다.
BLIP-2에서 미세 조정된 언어 인식 키프레임 로컬라이저와 Question Answerer를 도입한다.
전진 체인(Localizer -> Answerer)과 역방향 체인(의사 라벨 기반 Localizer 정제)을 통한 자기 개선을 가능하게 한다.
미세 조정 및 제로샷 설정에서 다수의 비디오 QA 및 이벤트 예측 벤치마크에서 강력한 성능을 입증한다.

제안 방법

고정된 이미지 인코더와 LLM을 가진 백본으로 BLIP-2를 채택하고 Q-Formers와 모듈당 선형 층만 미세 조정한다.
Localizer는 균일하게 샘플링된 프레임에서 언어 정보를 활용한 프롬프트와 LLM을 사용해 프레임의 질문 관련성을 점수화하여 상위-K의 언어 인식 키프레임을 선택한다.
Answerer는 선택된 키프레임의 특징을 연결하고 이를 LLM에 입력해 비디오 수준의 답을 생성한다.
전진 체인은 Localizer의 키프레임으로 Answerer를 학습시켜 QA 성능을 향상시킨다.
역방향 체인은 Answerer의 출력으로 프레임 수준의 의사 라벨을 생성하여 명시적 프레임 수준 주석 없이 Localizer를 정제한다.
Localizer의 사전 학습으로 moment retrieval 데이터(QVHighlights)를 사용하여 프레임 수준 로컬라이제이션 프라이어를 제공한다.
두 단계의 자기 체인(전진 추론 및 역방향 정제)은 향상된 시간적 로컬라이제이션과 QA 정확도를 가져온다.

실험 결과

연구 질문

RQ1하나의 이미지-언어 모델을 재목적으로 활용해 비디오의 시간적 로컬라이제이션과 QA를 모두 수행할 수 있는가?
RQ2언어 인식 키프레임 선택이 비디오 QA/이벤트 예측을 균등한 프레임 샘플링보다 개선하는가?
RQ3QA 출력에서의 의사 라벨이 프레임 수준 주석 없이도 언어 인식 로컬라이저를 효율적으로 정제할 수 있는가?
RQ4Localizer를 비디오 모먼트 검색 데이터로 사전 학습시키는 것이 다운스트림 QA 성능에 미치는 영향은 무엇인가?
RQ5SeViLA가 다수의 벤치마크에서 미세 조정 및 제로샷 설정에서 어떤 성능을 보이는가?

주요 결과

SeViLA는 다섯 개의 비디오 QA 및 이벤트 예측 벤치마크에서 여러 강력한 기준선을 능가한다.
제로샷 Localizer + Answerer는 제로샷 설정에서 여러 데이터세트에서 새로운 최첨단 성능을 달성한다(NExT-QA, STAR, How2QA, TVQA, VLEP).
의사 라벨을 통한 자기 개선은 모든 작업에서 Localizer의 성능을 일관되게 향상시킨다(결과의 평균 이득은 ablations에서 보고됨).
언어 인식 키프레임을 이용한 시간적 로컬라이제이션은 프레임 균일 샘플링에 비해 QA 정확도에 크게 이익이 되며, 특히 시간적으로 요구가 높은 작업에서 그렇다.
Localizer는 명시적 시간 모델링이 사전 학습에서 없더라도 강력한 독립 모먼트 검색 모델로 작동하여 경쟁력 있는 결과를 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.