QUICK REVIEW

[논문 리뷰] Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation

Wenjing Wang, Huan Yang|arXiv (Cornell University)|2023. 05. 18.

Generative Adversarial Networks and Image Synthesis인용 수 19

한 줄 요약

VideoFactory는 3D-윈도우 확산 프레임워크에서 Swapped Spatiotemporal Cross-Attention(Swap-CA)을 도입하여 공간과 시간을 공동으로 모델링하고, HD-VG-130M에서 학습하여 고해상도 워터마크-free 16:9 비디오를 달성합니다.

ABSTRACT

With the explosive popularity of AI-generated content (AIGC), video generation has recently received a lot of attention. Generating videos guided by text instructions poses significant challenges, such as modeling the complex relationship between space and time, and the lack of large-scale text-video paired data. Existing text-video datasets suffer from limitations in both content quality and scale, or they are not open-source, rendering them inaccessible for study and use. For model design, previous approaches extend pretrained text-to-image generation models by adding temporal 1D convolution/attention modules for video generation. However, these approaches overlook the importance of jointly modeling space and time, inevitably leading to temporal distortions and misalignment between texts and videos. In this paper, we propose a novel approach that strengthens the interaction between spatial and temporal perceptions. In particular, we utilize a swapped cross-attention mechanism in 3D windows that alternates the "query" role between spatial and temporal blocks, enabling mutual reinforcement for each other. Moreover, to fully unlock model capabilities for high-quality video generation and promote the development of the field, we curate a large-scale and open-source video dataset called HD-VG-130M. This dataset comprises 130 million text-video pairs from the open-domain, ensuring high-definition, widescreen and watermark-free characters. A smaller-scale yet more meticulously cleaned subset further enhances the data quality, aiding models in achieving superior performance. Experimental quantitative and qualitative results demonstrate the superiority of our approach in terms of per-frame quality, temporal correlation, and text-video alignment, with clear margins.

연구 동기 및 목표

프레임 단위의 이미지 백본을 넘어 고품질의 오픈 도메인 비디오 생성 동기를 제시합니다.
공간-시간 공동 모델링을 탐구하여 시간적 왜곡을 줄이고 텍스트-비디오 정렬을 향상시킵니다.
고해상도 출력과 워터마크 없는 비디오 생성을 위한 확장 가능한 파이프라인을 개발합니다.
오픈 도메인 비디오 합성을 지원하기 위한 대규모 학습 코퍼스(HD-VG-130M)를 구축합니다.

제안 방법

Swap-CA를 3D 윈도우에서 제안하여 공간 및 시간 특징 간의 상호 작용을 가능하게 합니다.
3D 노이즈 예측을 위한 시공간 U-Net이 포함된 잠재 확산 프레임워크를 사용합니다.
블록 경계에서 Swap-CA를 적용하고 3D 윈도우 주의를 사용하여 성능과 효율의 균형을 맞춥니다.
BLIP-2 캡션으로 훈련 데이터를 갖춘 대규모 130M-텍스트-비디오 페어 데이터셋(HD-VG-130M)을 오픈 도메인 소스에서 구성합니다.
출력 해상도 1376×768를 달성하기 위해 2× 공간 업스케일링 및 Real-ESRGAN 기반 초해상도 모듈을 적용합니다.
일반화 능력을 높이기 위해 HD-VG-130M과 WebVid-10M의 공동 데이터를 사용하여 학습합니다.

Figure 1 : The paradigm of Swapped Spatiotemporal Cross-Attention (Swap-CA) in comparison with existing video attention schemes. Instead of only conducting self-attention in (a)-(c), we perform cross-attention between spatial and temporal modules in a U-Net, which encourages more spatiotemporal mutu

실험 결과

연구 질문

RQ1공간-시간의 결합 상호작용이 텍스트-비디오 생성의 품질과 의미적 정렬에 어떤 개선을 가져오는가?
RQ2공간과 시간 모달리티 간의 swapped cross-attention이 시간적 왜곡을 줄이고 텍스트-비디오의 일관성을 향상시키는가?
RQ3대규모 고해상도 오픈 도메인 비디오 데이터가 비디오 생성 성능에 어떤 영향을 미치는가?

주요 결과

Swap-CA는 공간 및 시간 특징 간의 상호 강화를 가능하게 하여 텍스트-비디오 정렬(CLIPSIM)과 비디오 품질(FVD)을 향상시키는 것으로 확인되었습니다(발췌 중 ablations에서).
3D 윈도우 주의는 메모리 및 시간 비용을 크게 줄이면서도 성능을 유지하거나 향상시킵니다.
HD-VG-130M(130M 페어 오픈 도메인 데이터셋)은 생성 품질을 크게 향상시키며(WebVid-10M 검증에서 FVD가 45.74 향상), 생성을 강화합니다.
VideoFactory는 고해상도 1376×768의 와이드스크린 비디오와 워터마크 없는 출력을 달성합니다.
제로샷 평가에서 VideoFactory는 MSR-VTT(CLIPSIM 0.3005)와 UCF101(FVD 410.0)에서 여러 베이스라인 대비 경쟁력 있거나 우수한 점수를 달성합니다.
인간 평가에서 VideoFactory가 비디오 품질과 텍스트-비디오 상관성 측면에서 다수의 최신 방법들에 비해 우수하거나 우세한 평가를 받았습니다.

Figure 2 : Statistics of video categories, clip durations, and caption word lengths in HD-VG-130M. HD-VG-130M covers a wide range of video categories.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.