Skip to main content
QUICK REVIEW

[论文解读] Wan: Open and Advanced Large-Scale Video Generative Models

WanTeam, Ang Wang|ArXiv.org|Mar 26, 2025
Generative Adversarial Networks and Image Synthesis被引用 6
一句话总结

Wan 提供了一套开放的、大规模视频基础模型(1.3B 和 14B),展示了在数据/模型扩展、高效的消费级 GPU 使用以及开源发布方面的强大任务视频生成能力。

ABSTRACT

This report presents Wan, a comprehensive and open suite of video foundation models designed to push the boundaries of video generation. Built upon the mainstream diffusion transformer paradigm, Wan achieves significant advancements in generative capabilities through a series of innovations, including our novel VAE, scalable pre-training strategies, large-scale data curation, and automated evaluation metrics. These contributions collectively enhance the model's performance and versatility. Specifically, Wan is characterized by four key features: Leading Performance: The 14B model of Wan, trained on a vast dataset comprising billions of images and videos, demonstrates the scaling laws of video generation with respect to both data and model size. It consistently outperforms the existing open-source models as well as state-of-the-art commercial solutions across multiple internal and external benchmarks, demonstrating a clear and significant performance superiority. Comprehensiveness: Wan offers two capable models, i.e., 1.3B and 14B parameters, for efficiency and effectiveness respectively. It also covers multiple downstream applications, including image-to-video, instruction-guided video editing, and personal video generation, encompassing up to eight tasks. Consumer-Grade Efficiency: The 1.3B model demonstrates exceptional resource efficiency, requiring only 8.19 GB VRAM, making it compatible with a wide range of consumer-grade GPUs. Openness: We open-source the entire series of Wan, including source code and all models, with the goal of fostering the growth of the video generation community. This openness seeks to significantly expand the creative possibilities of video production in the industry and provide academia with high-quality video foundation models. All the code and models are available at https://github.com/Wan-Video/Wan2.1.

研究动机与目标

  • 展示以扩散-变换器为骨干的开放、可扩展的视频生成。
  • 展示一整套模型(1.3B 与 14B),覆盖多样化视频任务。
  • 强调数据整理、创新的 VAE、可扩展的预训练和自动化评估,以推动视频生成的发展。
  • 提供对消费级 GPU 友好的配置,以扩展可访问性。

提出的方法

  • 在视频生成上建立于扩散-变换器架构。
  • 引入一个新颖的 VAE 组件,以提升视频建模。
  • 开发在数十亿张图片和视频上的可扩展预训练策略。
  • 整理大规模数据并实现自动化评估指标。
  • 对社区使用,开源完整代码库及所有模型。

实验结果

研究问题

  • RQ1开放的、大规模视频基础模型能否在标准基准测试中超过开源与商用的视频生成系统?
  • RQ2数据和模型规模如何影响视频生成的质量与效率?
  • RQ3面向消费级 GPU 的 1.3B 模型是否在保持高效的同时具备强大能力?
  • RQ4新颖 VAE 与自动化评估对视频生成性能有何影响?
  • RQ5开放性在多大程度上促进了视频生成社区的发展?

主要发现

  • 14B 的 Wan 模型在数十亿张图片和视频上进行训练,在内部和外部基准测试中展示出优于现有开源解决方案和部分商用解决方案的性能。
  • Wan 提供两种模型,1.3B 和 14B,覆盖多种下游任务的效率与效果。
  • 1.3B 模型实现了显著的消费级 GPU VRAM 效率,大约使用 8.19 GB VRAM。
  • 该套件完全开源,包括代码和所有模型,以促进视频生成社区的成长。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。