QUICK REVIEW

[論文レビュー] CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng|arXiv (Cornell University)|Aug 12, 2024

Natural Language Processing Techniques被引用数 15

ひとこと要約

CogVideoXは、テキストプロンプトに導かれた大規模拡散-トランスフォーマー型モデルで、16 fps で 10 秒、768×1360 の映像を生成します。3D VAE、適応的 LayerNorm を持つエキスパート・トランスフォーマー、そしてプログレッシブなマルチ解像度訓練を使用します。

ABSTRACT

We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer, which can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels. Previous video generation models often had limited movement and short durations, and is difficult to generate videos with coherent narratives based on text. We propose several designs to address these issues. First, we propose a 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions, to improve both compression rate and video fidelity. Second, to improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. Third, by employing a progressive training and multi-resolution frame pack technique, CogVideoX is adept at producing coherent, long-duration, different shape videos characterized by significant motions. In addition, we develop an effective text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method, greatly contributing to the generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weight of both 3D Causal VAE, Video caption model and CogVideoX are publicly available at https://github.com/THUDM/CogVideo.

研究の動機と目的

長時間にわたる時間的に一貫したテキストから動画生成の課題に対処する。
テキストと動画の整合性と動的モーション理解を向上させる。
高解像度でのスケーラブルな訓練と高忠実度の動画再構成を実現する。

提案手法

高忠実度と短いシーケンス長の実現のため、空間・時間的次元を横断して動画を圧縮する3D因果的VAEを導入する。
テキストと動画モダリティを融合させるため、エキスパート適応型 LayerNorm を備えたエキスパート・トランスフォーマーを組み込む。
3D全注意機構と3D-RoPEを適用して大規模モーションと時空間関係を捉える。
多解像度フレームパックとプログレッシブ訓練を用いて、異なる解像度と長さを扱う。
高品質な訓練用テキスト-動画ペアを生成するための密な動画キャプショニング・パイプラインを開発する。
トレーニング損失を安定化させ、収束を加速するためのExplicit Uniform Samplingを実装する。

実験結果

リサーチクエスチョン

RQ1テキストプロンプトで条件付けされた長時間の時間的に一貫した映像生成をどう達成するか？
RQ2どのようなアーキテクチャの適応（3D VAE、エキスパート・トランスフォーマー、3Dアテンション）が動画とテキストモダリティを最も適合させるか？
RQ3プログレッシブ訓練、多解像度フレームパック、頑健なデータ/テキストパイプラインは、映像の忠実度と意味的整合性を改善できるか？

主な発見

CogVideoX-5BとCogVideoX-2Bは、複数のダイナミックなビデオベンチマークにおいて自動指標と人間評価の最先端性能を達成した。
3D VAEは因果畳み込みを用いたフリッカーを低減し、再構成品質を向上させ、より長いシーケンスの実現を可能にする。
エキスパート適応型 LayerNormと3D全注意は、テキストと動画の整合性と時間的一致性を向上させる。
プログレッシブ訓練と多解像度フレームパックは、安定した訓練と、768×1360 解像度、16 fps で最大10秒の高品質で長尺な映像を実現する。
動画キャプショニングとフィルタリングパイプラインは、訓練のための意味理解とデータ品質を大幅に改善する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。