QUICK REVIEW

[論文レビュー] VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng|arXiv (Cornell University)|Jun 11, 2024

Music and Audio Processing被引用数 10

ひとこと要約

VideoLLaMA 2 は Spatial-Temporal Convolution (STC) Connector と共同訓練された Audio Branch を導入し、マルチモーダルな動画理解を向上させ、MC-VQA、OE-VQA、動画キャプショニングでオープンソースモデルと対抗し、一部の専有モデルに近づく成果を達成。

ABSTRACT

In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data. Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues. Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks. Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and audio-video question-answering (AQA & OE-AVQA) benchmarks over existing models. These advancements underline VideoLLaMA 2's superior performance in multimodal comprehension, setting a new standard for intelligent video analysis systems. All models are public to facilitate further research.

研究の動機と目的

動画データの空間-時系列ダイナミクスをよりよく捉えることで、動画と言語理解を強化する。
共同訓練された音声ブランチを通じて音声-視覚の統合を改善する。
視覚ブランチと音声ブランチを部分的に分離したままモジュラー訓練を維持しつつ、LLM での跨モーダル推論を可能にする。

提案手法

画像レベルの CLIP バックボーン（ViT-L/14）を用いた Vision-Language Branch と、空間-時間表現学習のための専用 STC Connector を備えた二重ブランチアーキテクチャを採用。
BEATs を音声エンコーダとして用い、LLM 次元へ音声特徴を整列させる MLP を組み込んだ Audio-Language Branch を実装。
Spatial-Temporal Convolution Connector (STC) は二つの RegStage ブロックと 3D ダウンサンプラーから成り、トークン順序を保持しトークン数を削減。
凍結された視覚エンコーダを使用し、動画言語事前学習およびマルチタスク微調整時に STC コネクタと言語モデルを微調整。
マルチステージ訓練を実施：画像-動画-テキストデータでの事前学習、動画言語マルチタスク微調整、音声言語事前学習、音声-動画の共同訓練。
MC-VQA、OE-VQA、VC、および AQA/OE-AVQA ベンチマークでゼロショット性能を評価し、オープンソースと専有ベースラインと比較。

実験結果

リサーチクエスチョン

RQ1専用の Spatial-Temporal Convolution コネクタは、動画言語モデルにおける空間と時間情報の融合をどのように改善できるか？
RQ2共同訓練された音声ブランチを追加することで、VideoLLaMA 2 のマルチモーダル理解と跨モーダル推論は向上するか？
RQ3MC-VQA、OE-VQA、VC、および音声-視覚タスクにおける Open-source と proprietary Video-LMM に対する VideoLLaMA 2 の相対的利得はどの程度か？

主な発見

VideoLLaMA 2（7B および 8x7B backbones）は、オープンソースモデルに対して MC-VQA スコアで競合し、特定のベンチマークでは一部の専有モデルを上回る。
EgoSchema、Perception-Test、および MV-Bench MC-VQA タスクで、VideoLLaMA 2-7B は従来のオープンソース SOTA（例：LLaVA-NeXT-Video）を上回り、MV-Bench 上で GPT4-V を打ち負かす。
動画キャプショニング（MSVC）では、VideoLLaMA 2 は他のオープンソースモデルよりも正確さと詳細性が高いが、いくつかの指標では GPT4-V が依然強い。
OE-VQA では、VideoLLaMA 2 は一般的に複数のオープンソースベースラインを上回り、MSVD や Video-ChatGPT ベンチマークなどで LLAVA-NeXT-Video と競合的。
音声理解ベンチマークは、音声言語および音声-視覚タスクで強い性能を示し、音声-視覚の共同訓練フェーズで支えられる。
LLM バックボーンを 7B から Mixtral-8x7B に拡大すると、MC-VQA の性能に顕著な向上をもたらす。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。