QUICK REVIEW

[논문 리뷰] Towards High Resolution Video Generation with Progressive Growing of Sliced Wasserstein GANs

U. Dinesh Acharya, Zhiwu Huang|arXiv (Cornell University)|2018. 10. 04.

Generative Adversarial Networks and Image Synthesis참고 문헌 2인용 수 45

한 줄 요약

이 연구는 점진적으로 증가하는 GAN을 고해상도 비디오 생성으로 확장하여 시공간 계층을 점진적으로 추가하고, 고차원 비디오 데이터의 안정적인 학습을 위해 Sliced Wasserstein GAN 손실(SWGAN)을 사용하여 학습을 안정화하였다. 새로운 256x256x32 얼굴 역동성 비디오 데이터셋에서 시연.

ABSTRACT

The extension of image generation to video generation turns out to be a very difficult task, since the temporal dimension of videos introduces an extra challenge during the generation process. Besides, due to the limitation of memory and training stability, the generation becomes increasingly challenging with the increase of the resolution/duration of videos. In this work, we exploit the idea of progressive growing of Generative Adversarial Networks (GANs) for higher resolution video generation. In particular, we begin to produce video samples of low-resolution and short-duration, and then progressively increase both resolution and duration alone (or jointly) by adding new spatiotemporal convolutional layers to the current networks. Starting from the learning on a very raw-level spatial appearance and temporal movement of the video distribution, the proposed progressive method learns spatiotemporal information incrementally to generate higher resolution videos. Furthermore, we introduce a sliced version of Wasserstein GAN (SWGAN) loss to improve the distribution learning on the video data of high-dimension and mixed-spatiotemporal distribution. SWGAN loss replaces the distance between joint distributions by that of one-dimensional marginal distributions, making the loss easier to compute. We evaluate the proposed model on our collected face video dataset of 10,900 videos to generate photorealistic face videos of 256x256x32 resolution. In addition, our model also reaches a record inception score of 14.57 in unsupervised action recognition dataset UCF-101.

연구 동기 및 목표

고해상도 비디오 생성에서의 불안정성과 메모리 문제를 해결한다.
비디오 해상도와 지속 시간을 점진적으로 증가시키는 점진적 증가 프레임워크를 제안한다.
고차원 비디오 데이터에서 분포 학습을 안정화하기 위해 Sliced Wasserstein GAN (SWGAN) 손실을 도입한다.
학습 및 평가를 위한 대규모 얼굴 역동성 비디오 데이터셋(TrailerFaces, 약 10.9k 클립)을 생성한다.
외관과 동역학 모두에서 기존 비디오 GAN보다 향상을 입증하고, Inception 점수와 FID 지표에서도 경쟁력을 보여준다.

제안 방법

비디오 생성을 위한 시공간 도메인으로의 Progressive Growing of GANs 확장.
새로운 계층을 통해 해상도와 지속 시간을 점진적으로 추가하기 위해 3D 컨볼루션과 전이 단계를 사용한다.
학습 안정화를 위해 미니배치 표준편차 및 픽셀 정규화를 도입한다.
고차원 분포 학습의 안정화를 위해 1차원 프로젝션을 통해 Wasserstein Distance를 근사하는 SWGAN 손실을 채택한다.
학습과 평가를 위한 얼굴 역동성 비디오 클립 10,910개(또는 10,900개)의 TrailerFaces 데이터셋을 구성하고 활용한다.
UCF-101 및 wild 데이터셋에서 Inception Score(IS) 및 Frechet Inception Distance(FID)로 평가한다.

실험 결과

연구 질문

RQ1GAN의 점진적 증강을 효과적으로 확장하여 더 높은 해상도와 더 긴 비디오 시퀀스를 생성할 수 있는가?
RQ2Sliced Wasserstein GAN 손실이 고차원 비디오 생성의 안정성과 품질을 향상시키는가?
RQ3비디오 GAN의 외관과 동역학 모두의 향상을 가장 잘 보여주는 데이터셋과 평가 지표는 무엇인가?
RQ4제안된 방법이 표준 및 wild 데이터셋에서 기존 비디오 GAN(VideoGAN, Temporal GAN 등)과 비교하여 어떻게 성능을 발휘하는가?

주요 결과

이 방법은 256x256x32 해상도에 이르는 비디오 생성을 가능하게 하며, 기존에 보고된 64x64x32보다 크다.
시공간 계층을 갖춘 점진적 증가 전략은 기존 방법들에 비해 외관과 동역학에서 개선을 보인다.
SWGAN 손실은 고차원 비디오 분포에서 안정적인 학습을 촉진하고 점진적 프레임워크와 통합된다.
이 방법은 UCF-101 동작 인식 데이터셋에서 무감독으로 14.57의 기록적인 Inception Score를 달성한다.
두 개의 도전적인 wild 데이터셋에서 SOTA 방법들보다 더 좋은 FID 점수를 얻는다.
고해상도 비디오 생성 연구를 지원하기 위해 얼굴 역동성 비디오 클립 10,910개가 있는 새로운 TrailerFaces 데이터셋이 도입되었다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.