QUICK REVIEW

[論文レビュー] HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian|arXiv (Cornell University)|Dec 3, 2024

Generative Adversarial Networks and Image Synthesis被引用数 6

ひとこと要約

HunyuanVideoは、データキュレーション、アーキテクチャ、スケーリング、インフラストラクチャという体系的なフレームワークを備えたオープンソースの13Bビデオ基盤モデルで、視覚品質・動き・テキストとビデオの整合性において、主要なクローズドソースモデルに伍する。人間評価では従来のオープンソースのベースラインや一部の商用中国モデルを上回る。

ABSTRACT

Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure tailored for large-scale model training and inference. As a result, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion dynamics, text-video alignment, and advanced filming techniques. According to evaluations by professionals, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models. By releasing the code for the foundation model and its applications, we aim to bridge the gap between closed-source and open-source communities. This initiative will empower individuals within the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem. The code is publicly available at https://github.com/Tencent/HunyuanVideo.

研究の動機と目的

オープンソースとクローズドソースのビデオ基盤モデルの間のギャップを埋めるため、拡張性があり高品質なビデオ生成モデルを開発する。
大規模なビデオ生成のための、データキュレーション、モデルアーキテクチャ、トレーニングとデプロイメントを含むエンドツーエンドのトレーニングとデプロイメントフレームワークを設計する。
高い視覚品質と一貫した長編動画を実現するため、デコーディング/コンディショニング戦略（テキスト-ビデオ整合、モーションダイナミクス）を検討する。
コミュニティ主導のビデオ生成イノベーションを可能にするオープンソースの基盤モデルとツールを提供する。

提案手法

大規模なビデオ生成のための、データ処理、モデルアーキテクチャ、トレーニングと推論を含む包括的なオープンソースフレームワークを開発する。
動画と画像データを、拡散ベースの生成の潜在空間へ圧縮するために、3D VAE（Caudal 3D VAE）を使用する。
全空間時間 Attention and Rotary Position Embeddings extended to 3D を含む、統一された Transformer ベースの拡散バックボーンを採用する。
指針のために Multimodal Large Language Model (MLLM) に基づくテキストエンコーダを統合し、CLIP 機能がグローバルなプロンプトを提供する。
学習目的として Flow Matching を適用し、二段階の段階的な画像事前学習と画像-video 統合トレーニングパイプラインを実施する。
プロンプト追従性と制御性を向上させるため、データキュレーションフィルター、構造化キャプション、カメラ動作注釈を実装する。
推論を加速しサンプル品質を向上させるため、タイムステップシフティングとガイダンス蒸留を採用する。

実験結果

リサーチクエスチョン

RQ1オープンソースのビデオ基盤モデルは、先進的なクローズドソースモデルの性能に到達または上回ることができるか。
RQ2大規模で高品質かつ時間的に一貫したビデオ生成を可能にする、データキュレーション、トレーニングカリキュラム、アーキテクチャの選択は何か。
RQ3テキスト-ビデオ整合とカメラ/モーション制御を、統一された拡散フレームワークに効果的に統合するには。
RQ4大規模ビデオモデルのために、計算資源・データ・モデルサイズを最適化するためのスケーリング則と段階的訓練戦略は何か。

主な発見

本プロジェクトは、13Bパラメータのオープンソースビデオモデルを訓練し、報告された最大規模のオープンソースビデオモデルとなった。
1,500超のプロンプトを60名の評価者が評価した人間評価で、HunyuanVideoはGen-3、Luma 1.6、トップクラスの中国モデルを上回り、特にモーションダイナミクスで優れることを示した。
最適なデータ・リソース・モデルスケーリング戦略により、必要な計算資源を5倍削減できる。
逐次的ファインチューニングとカリキュラム学習を通じて、視覚品質・モーションダイナミクス・強力なテキスト-ビデオ整合を実現する。
全空間時間アテンションと RoPE に基づく 3D ポジション埋め込みを備えた統一された拡散バックボーンは、単一のフレームワークで効果的な画像および動画生成を実現する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。