QUICK REVIEW

[논문 리뷰] Valley: Video Assistant with Large Language model Enhanced abilitY

Ruipu Luo, Ziwang Zhao|arXiv (Cornell University)|2023. 06. 12.

Multimodal Machine Learning Applications인용 수 30

한 줄 요약

Valley는 비디오, 이미지, 언어를 간단한 프로젝션 브리지로 융합하여 대형 언어 모델 백본을 활용한 비디오 기반 지시 수행 및 대화를 가능하게 하는 멀티모달 파운데이션 모델이다. 이는 100k-video instruction dataset를 포함한 두 단계 사전학습 및 지시-tuning 파이프라인을 사용한다.

ABSTRACT

Large Language Models (LLMs), with remarkable conversational capability, have emerged as AI assistants that can handle both visual and textual modalities. However, their effectiveness in joint video and language understanding has not been extensively explored. In the paper, we introduce Valley, a multi-modal foundation model that is designed to enable enhanced video comprehension and instruction-following capabilities. To this end, we construct two datasets, namely Valley-702k and Valley-instruct-73k, to cover a diverse range of video-text alignment and video-based instruction tasks, such as multi-shot captions, long video descriptions, action recognition, causal inference, etc. Then, we adopt ViT-L/14 as the vision encoder and explore three different temporal modeling modules to learn multifaceted features for enhanced video understanding. In addition, we implement a two-phase training approach for Valley: the first phase focuses solely on training the projection module to facilitate the LLM's capacity to understand visual input, and the second phase jointly trains the projection module and the LLM to improve their instruction following ability. Extensive experiments demonstrate that Valley has the potential to serve as an effective video assistant, simplifying complex video-understanding scenarios. Our code and data are published anonymously at https://github.com/valley-vl/Valley.

연구 동기 및 목표

일반 비디오-기반 멀티모달 이해의 필요성을 제시하고 특정 작업에 국한되지 않는 모델의 가능성을 제시한다.
Valley를 프로젝션 레이어로 연결된 비디오–이미지–언어 파운데이션 모델로 제안한다.
다중 작업 비디오 이해를 학습하기 위한 고품질의 ChatGPT-지원 지시 데이터셋을 만든다.
시각-언어 정렬을 위한 두 단계 학습 파이프라인(프로젝션 사전학습 후 엔드투엔드 파인튜닝)을 채택한다.
Valley의 비디오 QA 및 캡션 벤치마크에서 제로샷 성능을 최첨단으로 시연한다.

제안 방법

시각 인코더로 ViT-L/14 (CLIP)을 사용하여 프레임 특징을 추출한다.
시간 정보를 집계하기 위한 세 가지 구조(v1, v2, v3)의 시간적 모델링 모듈을 도입한다.
LLM(Stable-Vicuna)에 입력하기 전에 간단한 프로젝션 레이어를 통해 비전과 언어를 연결한다.
세부 묘사, 대화 및 복합 추론을 포괄하는 ChatGPT-지원 프롬프트를 포함한 100k-video instruction 데이터셋을 구성한다.
두 단계의 학습: (1) 이미지-텍스트 및 비디오-텍스트 쌍에서 프로젝션 모듈을 사전학습; (2) 234k 이미지/비디오 지시 데이터에서 프로젝션 및 LLM을 엔드투엔드로 미세조정한다.
Valley를 제로샷 및 소수샷 설정에서 다수의 비디오 QA 및 멀티모달 벤치마크로 평가한다.]
research_questions:[
하나의 멀티모달 파운데이션 모델이 비디오, 이미지 및 언어를 이해하고 자연어로 상호작용할 수 있는가?
간단한 프로젝션 브리지가 시각 특징을 LLM과 정렬하여 강건한 비디오 기반 지시 수행에 충분한가?
Valley의 제로샷 및 소수샷 비디오 QA, 캡션 생성, 멀티모달 추론에서 최첨단 기법과 비교해 어떤 성능을 보이는가?
긴 영상과 짧은 영상 이해에 대한 서로 다른 시간 모델링 전략의 영향은 무엇인가?

실험 결과

연구 질문

RQ1Can a single multi-modal foundation model comprehend video, image, and language and interact via natural language?
RQ2Does a simple projection bridge suffice to align visual features with an LLM for robust video-grounded instruction following?
RQ3How does Valley perform on zero-shot and few-shot video QA, captioning, and multimodal reasoning compared to state-of-the-art baselines?
RQ4What is the impact of different temporal modeling strategies on long vs short video understanding?

주요 결과

Valley는 보고된 방법들 중 MSVD-QA, MSRVTT-QA, 및 ActivityNet-QA 벤치마크에서 제로샷 성능이 최첨단이다.
Valley-v3는 더 긴 비디오(MSRVTT-QA 및 ActivityNet-QA)에서 뛰어나고, Valley-v1은 짧은 비디오(MSVD-QA)에서 최상의 성능을 보인다.
비디오 기반 생성 벤치마크에서 Valley-v3가 정확성, 맥락 이해, 시간 이해 및 일관성 면에서 선두를 차지한다.
Valley는 ScienceQA에서 체인-오브-생각 및 소수샷 능력을 경쟁력 있게 시연하며 특정 설정에서 때때로 GPT-3.5를 능가한다.
제안된 세 가지 시간 모델링 변형은 시간 정보를 효과적으로 포착하며, v3가 더 긴 시퀀스에 이점이 있음이 나타난다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.