[논문 리뷰] Adversarial Video Generation on Complex Datasets
DVD-GAN은 Kinetics-600에서 고충실도 비디오 생성을 위한 확장 가능한 이중 판별기 GAN을 제시하여 비디오 합성 및 예측에 대한 최신 연구 성과를 달성합니다.
Generative models of natural images have progressed towards high fidelity samples by the strong leveraging of scale. We attempt to carry this success to the field of video modeling by showing that large Generative Adversarial Networks trained on the complex Kinetics-600 dataset are able to produce video samples of substantially higher complexity and fidelity than previous work. Our proposed model, Dual Video Discriminator GAN (DVD-GAN), scales to longer and higher resolution videos by leveraging a computationally efficient decomposition of its discriminator. We evaluate on the related tasks of video synthesis and video prediction, and achieve new state-of-the-art Fréchet Inception Distance for prediction for Kinetics-600, as well as state-of-the-art Inception Score for synthesis on the UCF-101 dataset, alongside establishing a strong baseline for synthesis on Kinetics-600.
연구 동기 및 목표
- Aim to extend high-fidelity image generation success to the video domain using large-scale datasets.
- Develop a scalable GAN architecture capable of producing long, high-resolution videos.
- Establish strong baselines for class-conditional video synthesis on Kinetics-600.
- Evaluate on video synthesis and video prediction to benchmark temporal dynamics and quality.
제안 방법
- Build on BigGAN to create a Dual Video Discriminator GAN (DVD-GAN) for videos.
- Introduce two discriminators: a Spatial Discriminator (D_S) and a Temporal Discriminator (D_T).
- Downsample the input to D_T with a function phi to reduce computational load while preserving feedback.
- Sample k frames for D_S to judge per-frame content, summing their scores for final D_S output.
- Discriminator objective uses hinge loss with D_S and D_T supplying learning signals without full-video processing.
- Train on TPU pods with large-scale distributed training to handle 256×256 and up to 48-frame videos.
실험 결과
연구 질문
- RQ1Can a scalable GAN architecture generate high-fidelity, long-range videos on a diverse dataset like Kinetics-600?
- RQ2Does decomposing discrimination into spatial and temporal components preserve feedback necessary for realism at high resolutions?
- RQ3What are the effects of downsampling and frame sampling (k) on synthesis quality and diversity?
- RQ4How does DVD-GAN perform on class-conditional video synthesis and future video prediction compared to prior methods?
주요 결과
- DVD-GAN achieves state-of-the-art Inception Score on UCF-101 for video synthesis.
- On Kinetics-600, DVD-GAN attains high-fidelity samples at 64×64, 128×128, and 256×256 with up to 48 frames, demonstrating scalable performance.
- For synthesis on Kinetics-600, reported FID and IS improve over baselines across multiple resolutions and frame lengths.
- For prediction, DVD-GAN-FP achieves notably lower Fréchet Video Distance than prior adversarial models on Kinetics-600 and BAIR datasets.
- The dual-discriminator setup significantly reduces computational burden while maintaining strong feedback signals for realism across space and time.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.