QUICK REVIEW

[논문 리뷰] Wan: Open and Advanced Large-Scale Video Generative Models

WanTeam, Ang Wang|ArXiv.org|2025. 03. 26.

Generative Adversarial Networks and Image Synthesis인용 수 6

한 줄 요약

Wan은 데이터/모델 확장과 효율적인 소비자용 GPU 사용 및 오픈 소스 릴리스를 통해 다양한 작업에서 강력한 비디오 생성을 입증하는 대규모 비디오 기반 모델(1.3B와 14B)의 오픈형 스위트를 제시한다.

ABSTRACT

This report presents Wan, a comprehensive and open suite of video foundation models designed to push the boundaries of video generation. Built upon the mainstream diffusion transformer paradigm, Wan achieves significant advancements in generative capabilities through a series of innovations, including our novel VAE, scalable pre-training strategies, large-scale data curation, and automated evaluation metrics. These contributions collectively enhance the model's performance and versatility. Specifically, Wan is characterized by four key features: Leading Performance: The 14B model of Wan, trained on a vast dataset comprising billions of images and videos, demonstrates the scaling laws of video generation with respect to both data and model size. It consistently outperforms the existing open-source models as well as state-of-the-art commercial solutions across multiple internal and external benchmarks, demonstrating a clear and significant performance superiority. Comprehensiveness: Wan offers two capable models, i.e., 1.3B and 14B parameters, for efficiency and effectiveness respectively. It also covers multiple downstream applications, including image-to-video, instruction-guided video editing, and personal video generation, encompassing up to eight tasks. Consumer-Grade Efficiency: The 1.3B model demonstrates exceptional resource efficiency, requiring only 8.19 GB VRAM, making it compatible with a wide range of consumer-grade GPUs. Openness: We open-source the entire series of Wan, including source code and all models, with the goal of fostering the growth of the video generation community. This openness seeks to significantly expand the creative possibilities of video production in the industry and provide academia with high-quality video foundation models. All the code and models are available at https://github.com/Wan-Video/Wan2.1.

연구 동기 및 목표

확산-트랜스포머 백본으로 개방적이고 확장 가능한 비디오 생성을 입증한다.
다양한 비디오 태스크를 위한 포괄적인 모델 세트(1.3B 및 14B)를 선보인다.
비디오 생성을 진전시키기 위한 데이터 큐레이션, 새롭고 독창적인 VAE, 확장 가능한 프리트레이닝, 자동 평가를 강조한다.
접근성을 높이기 위한 소비자-GPU 친화적 구성들을 제공한다.

제안 방법

비디오 생성을 위한 확산-트랜스포머 아키텍처를 기반으로 한다.
비디오 모델링 강화를 위한 새로운 VAE 구성요소를 도입한다.
수십억 개의 이미지와 비디오를 대상으로 확장 가능한 프리트레이닝 전략을 개발한다.
대규모 데이터를 큐레이션하고 자동 평가 지표를 구현한다.
전체 코드베이스와 모든 모델을 커뮤니티 사용을 위해 오픈소스로 제공한다.

실험 결과

연구 질문

RQ1표준 벤치마크에서 개방형 대규모 비디오 기초 모델이 오픈소스 및 상용 비디오 생성 시스템을 능가할 수 있는가?
RQ2데이터와 모델 규모가 비디오 생성 품질과 효율성에 어떤 영향을 미치는가?
RQ3소비자-GPU 친화적인 1.3B 모델이 효율성을 유지하면서 강력한 기능을 제공할 수 있는가?
RQ4새로운 VAE와 자동 평가가 비디오 생성 성능에 미치는 영향은 무엇인가?
RQ5개방성이 비디오 생성 커뮤니티의 진전을 어느 정도 가속화하는가?

주요 결과

수십억 개의 이미지와 비디오로 학습된 14B Wan 모델은 기존의 오픈소스 및 일부 상용 솔루션과 비교하여 내부 및 외부 벤치마크에서 우수한 성능을 입증한다.
Wan은 1.3B와 14B의 두 모델을 제공하여 여러 다운스트림 태스크에 대한 효율성과 효과를 포괄한다.
1.3B 모델은 소비자-GPU VRAM 효율성을 주목할 만큼 달성했으며 약 8.19 GB VRAM를 사용한다.
이 스위트는 코드와 모든 모델을 포함하여 완전히 오픈소스로 제공되어 비디오 생성을 위한 커뮤니티 성장을 촉진한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.