QUICK REVIEW

[論文レビュー] SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model

Chen, Guibin, Lin, Dixuan|arXiv (Cornell University)|Feb 25, 2026

Generative Adversarial Networks and Image Synthesis被引用数 0

ひとこと要約

SkyReels-V4 は、マルチモーダルプロンプトから動画と音声を同時に生成する dual-stream Multi-Modal Diffusion Transformer であり、シネマ規模の解像度と長時間に対応した単一フレームワーク内で統合的なインペインティング/編集を実現します。

ABSTRACT

SkyReels V4 is a unified multi modal video foundation model for joint video audio generation, inpainting, and editing. The model adopts a dual stream Multimodal Diffusion Transformer (MMDiT) architecture, where one branch synthesizes video and the other generates temporally aligned audio, while sharing a powerful text encoder based on the Multimodal Large Language Models (MLLM). SkyReels V4 accepts rich multi modal instructions, including text, images, video clips, masks, and audio references. By combining the MLLMs multi modal instruction following capability with in context learning in the video branch MMDiT, the model can inject fine grained visual guidance under complex conditioning, while the audio branch MMDiT simultaneously leverages audio references to guide sound generation. On the video side, we adopt a channel concatenation formulation that unifies a wide range of inpainting style tasks, such as image to video, video extension, and video editing under a single interface, and naturally extends to vision referenced inpainting and editing via multi modal prompts. SkyReels V4 supports up to 1080p resolution, 32 FPS, and 15 second duration, enabling high fidelity, multi shot, cinema level video generation with synchronized audio. To make such high resolution, long-duration generation computationally feasible, we introduce an efficiency strategy: Joint generation of low resolution full sequences and high-resolution keyframes, followed by dedicated super-resolution and frame interpolation models. To our knowledge, SkyReels V4 is the first video foundation model that simultaneously supports multi-modal input, joint video audio generation, and a unified treatment of generation, inpainting, and editing, while maintaining strong efficiency and quality at cinematic resolutions and durations.

研究の動機と目的

テキスト、画像、ビデオクリップ、マスク、音声リファレンスを条件とした、動画と音声を共同生成する統一型ファウンデーションモデルの前進。
マルチモーダル入力により駆動される、単一アーキテクチャ内での包括的なインペインティングと編集の実現。
共同の低解像度/高解像度のキーフレーム生成と超解像による計算効率化を図り、シネマ規模の動画生成（1080p、32 FPS、15秒）を実現。
視覚的・聴覚的条件付けを調和させる共通のマルチモーダル言語モデルバックボーンを通じたマルチモーダル指示追従の統合。

提案手法

片方の分岐が動画、もう片方が音声をモデル化する dual-stream MMDiT を提案し、マルチモーダル指示追従のための共通の凍結MLLM テキストエンコーダを共有。
動画分岐ではチャネル連結に基づくインペインティング枠組みを用い、画像→動画、動画拡張、編集、視覚参照インペインティングを条件付き生成の特殊ケースとして表現。
双方向の音声−映像クロスアテンションとクロスモーダル RoPE スケーリングを取り入れ、モダリティ間の時間的ダイナミクスを整合。
ビジョンとオーディオの参照を含むコンテキスト学習機構を付与するため、参照画像を動画自己注意へ入力し、条件トークンにはオフセット3D RoPE を使用。
テキスト・画像・ビデオクリップ・マスク・音声リファレンスを含むマルチモーダル入力を条件に、動画と音声を共同生成するフロー整合 objective で訓練。
低解像度の基底生成と高解像度のキーフレームを組み合わせた後処理（超解像と補間）で1080p 高品質出力を得る Refiner モジュールを導入し、Video Sparse Attention（VSA）による効率化を実現。

Figure 1: Overview of the proposed method.

実験結果

リサーチクエスチョン

RQ1単一アーキテクチャで、マルチモーダルプロンプトに条件づけられた動画と同期音声を共同生成する方法は？
RQ2チャネル連結条件付け枠組みの下で、動画インペインティング、編集、生成を統一できるか？
RQ3同期音声を伴う1080p、32 FPS、15秒のマルチショット動画生成を可能にする効率化戦略は？
RQ4共有MLLMバックボーンは、テキスト、画像、ビデオ、音声入力の指示追従とクロスモーダル整合性を向上させるか？
RQ5視覚参照生成と編集タスクにおけるマルチモーダル条件付けで、モデルはどの程度の性能を示すか？

主な発見

SkyReels-V4 は Artificial Analysis Arena ベンチマークで最先端の結果を達成。
人間評価では SkyReels-VABench が指示追従、運動品質、複雑なマルチショット物語化で独自システムに比して有意な改善を示す。
リファレンスからの動画、運動から動画、マルチモーダル条件付けによる動画編集タスクを堅牢に処理。
統一されたチャネル連結インペインティング枠組みにより、画像→動画、動画拡張、編集、視覚参照インペインティングを単一アーキテクチャ内で実現。
低解像度の全系列と高解像度キーフレームの共同戦略と後処理（超解像・補間）を組み合わせることで、現実的な計算予算内でシネマ品質の生成を実現。

Figure 2: The pipeline of the video super-resolution and frame interpolation method. F denotes the output latent of our base model. KF demotes the key frames latent of our base model.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。