QUICK REVIEW

[논문 리뷰] VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng|arXiv (Cornell University)|2024. 06. 11.

Music and Audio Processing인용 수 10

한 줄 요약

VideoLLaMA 2는 Spatial-Temporal Convolution (STC) 커넥터와 공동 학습 오디오 분기를 도입하여 멀티모달 비디오 이해를 향상시키고, MC-VQA, OE-VQA 및 비디오 캡션에서 오픈 소스 모델과 경쟁력 있는 결과를 보이며 일부 독점 모델에 근접.

ABSTRACT

In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data. Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues. Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks. Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and audio-video question-answering (AQA & OE-AVQA) benchmarks over existing models. These advancements underline VideoLLaMA 2's superior performance in multimodal comprehension, setting a new standard for intelligent video analysis systems. All models are public to facilitate further research.

연구 동기 및 목표

비디오 데이터의 시공간 역학을 더 잘 포착하여 비디오-언어 이해를 향상한다.
공동 학습 오디오 분장을 통해 음향-시각 통합을 개선한다.
시각 및 오디오 분기를 부분적으로 분리된 상태로 유지하면서 LLM에서 교차 모달 추론을 가능하게 하여 모듈식 학습을 유지한다.

제안 방법

이미지 수준 CLIP 백본(ViT-L/14)을 사용하는 Vision-Language Branch와 공간-시간 표현 학습을 위한 전용 STC 커넥터를 갖춘 이중 분기 아키텍처를 채택한다.
BEATs를 오디오 인코더로 사용하고 LLM 차원으로 오디오 특징을 정렬하기 위한 MLP를 포함하는 Audio-Language Branch를 구현한다.
두 개의 RegStage 블록과 3D 다운샘플러로 구성된 Spatial-Temporal Convolution Connector (STC)를 도입하여 토큰 순서를 보존하고 토큰 수를 줄인다.
고정된 비주얼 인코더를 사용하고 영상-언어 프리트레이닝 및 다중 작업 파인튜닝 동안 STC 커넥터와 언어 모델을 미세조정한다.
다단계 학습을 수행한다: 이미지-비디오-텍스트 데이터에서의 프리트레이닝, 비디오-언어 다중 작업 파인튜닝, 오디오-언어 프리트레이닝, 그리고 오디오-비디오 공동 학습.
MC-VQA, OE-VQA, VC 및 AQA/OE-AVQA 벤치마크에서 제로샷 성능을 평가하고 오픈 소스와 독점 베이스라인과 비교한다.

실험 결과

연구 질문

RQ1 dedicated Spatial-Temporal Convolution 커넥터가 비디오-언어 모델의 시공간 정보 융합을 어떻게 개선할 수 있는가?
RQ2공동 학습 오디오 분기를 추가하는 것이 VideoLLaMA 2의 멀티모달 이해 및 교차 모달 추론에 어떤 향상을 가져오는가?
RQ3MC-VQA, OE-VQA, VC, 및 오디오-시각 작업에서 VideoLLaMA 2의 오픈 소스 및 독점 Video-LMM 대비 상대적 이득은 무엇인가?

주요 결과

VideoLLaMA 2는 7B 및 8x7B 백본으로 오픈 소스 모델에 비해 MC-VQA 점수에서 경쟁력을 가지며 특정 벤치마크에서 일부 독점 모델을 능가한다.
EgoSchema, Perception-Test 및 MV-Bench MC-VQA 작업에서 VideoLLaMA 2-7B가 이전 오픈 소스 SOTA(LLaVA-NeXT-Video 등)보다 개선되었고 MV-Bench에서 GPT4-V를 이겼다.
비디오 캡션링(MSVC)에서 VideoLLaMA 2는 다른 모든 오픈 소스 모델보다 높은 정확도와 세부성을 달성하지만 일부 지표에서는 GPT4-V가 더 강력하다.
OE-VQA의 경우 VideoLLaMA 2는 일반적으로 여러 오픈 소스 베이스라인을 능가하고 MSVD 및 Video-ChatGPT 벤치마크에서 LLAVA-NeXT-Video와도 대등하다.
오디오 이해 벤치마크는 오디오-언어 및 오디오-시각 작업에서 강력한 성능을 보이며, 오디오-시각 공동 학습 단계에 의해 지원된다.
LLM 백본을 7B에서 Mixtral-8x7B로 확장하면 MC-VQA 성능에서 눈에 띄는 이득이 생긴다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.