QUICK REVIEW

[论文解读] HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian|arXiv (Cornell University)|Dec 3, 2024

Generative Adversarial Networks and Image Synthesis被引用 6

一句话总结

HunyuanVideo 是一个开源的 13B 视频基础模型，具备系统性框架（数据筛选/整理、架构、扩展与基础设施），在视觉质量、运动与文本-视频对齐方面可与领先的闭源模型相比肩。在人类评测中，它超过了以往的开源基线以及一些商业中文模型。

ABSTRACT

Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure tailored for large-scale model training and inference. As a result, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion dynamics, text-video alignment, and advanced filming techniques. According to evaluations by professionals, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models. By releasing the code for the foundation model and its applications, we aim to bridge the gap between closed-source and open-source communities. This initiative will empower individuals within the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem. The code is publicly available at https://github.com/Tencent/HunyuanVideo.

研究动机与目标

通过开发一个可扩展、高质量的视频生成器，弥合开源与闭源视频基础模型之间的差距。
设计一个覆盖数据整理、模型架构和基础设施的端到端训练与部署框架，用于大规模视频生成。
研究解码/条件化策略（文本-视频对齐、运动动力学），以实现高视觉质量和连贯的长视频。
提供一个开源的基础模型及工具，促进社区驱动的视频生成创新。

提出的方法

开发一个面向大规模视频生成的综合开源框架（数据处理、模型架构、训练与推理）。
使用3D VAE（Caudal 3D VAE）将视频和图像数据压缩到潜在空间，以进行基于扩散的生成。
采用统一的基于 Transformer 的扩散骨干，具备全时空注意力，并将 RoPE 的旋转位置嵌入扩展到 3D。
整合基于多模态大语言模型（MLLM）的文本编码器以提供引导，结合 CLIP 特征提供全局提示。
将 Flow Matching 作为训练目标，采用两阶段渐进式图像预训练与联合图像-视频训练流程。
实现数据整理筛选、结构化字幕和相机运动注释，以提升对提示的遵循性和可控性。
采用时间步移位和引导蒸馏，以加速推理并提升样本质量。

实验结果

研究问题

RQ1开源视频基础模型能否达到甚至超过领先闭源模型的性能？
RQ2哪些数据整理、训练课程和架构选择最能在大规模下实现高质量、时间上连贯的视频生成？
RQ3如何将文本-视频对齐以及相机/运动控制有效地整合到统一的扩散框架中？
RQ4哪些扩展规律与渐进式训练策略能在大规模视频模型中优化计算、数据与模型规模？

主要发现

该项目训练了一个 13B 参数的开源视频模型，成为公开报道的最大的开源视频模型。
在 1,500 多个提示、由 60 名评估者进行的人类评测中，HunyuanVideo 超越 Gen-3、Luma 1.6 以及顶级中文模型，尤其在运动动态方面。
通过最优的数据、资源和模型扩展策略，可以实现计算/资源需求的 5 倍降低。
该模型通过渐进式微调与课程学习实现高视觉质量、运动动态和强文本-视频对齐。
一个统一的扩散骨干，具备全时空注意力和基于 RoPE 的 3D 位置嵌入，使在单一框架中实现高效的图像和视频生成成为可能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。