QUICK REVIEW

[논문 리뷰] VideoGPT: Video Generation using VQ-VAE and Transformers

Wilson Yan, Yunzhi Zhang|arXiv (Cornell University)|2021. 04. 20.

Generative Adversarial Networks and Image Synthesis참고 문헌 74인용 수 144

한 줄 요약

VideoGPT는 비디오를 이산 잠재로 압축하기 위해 VQ-VAE를 사용하고, 그 잠재를 모델링하기 위해 GPT-스타일 자기회귀 트랜스포머를 활용하여 간단하고 확장 가능한 파이프라인으로 경쟁력 있는 비디오 생성을 가능하게 한다.

ABSTRACT

We present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGPT uses VQ-VAE that learns downsampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings. Despite the simplicity in formulation and ease of training, our architecture is able to generate samples competitive with state-of-the-art GAN models for video generation on the BAIR Robot dataset, and generate high fidelity natural videos from UCF-101 and Tumbler GIF Dataset (TGIF). We hope our proposed architecture serves as a reproducible reference for a minimalistic implementation of transformer based video generation models. Samples and code are available at https://wilson1yan.github.io/videogpt/index.html

연구 동기 및 목표

likelihood 기반의 자기회귀 모델이 자연 비디오 생성으로 확장될 수 있는지 조사한다.
시공간 복잡성을 줄이기 위한 이산 잠재 공간(VQ-VAE) 사용을 탐구한다.
축 방향 주의력과 잠재 공간 설계가 비디오의 현실감과 충실도에 미치는 영향을 평가한다.
조건부 및 비조건부 비디오 생성 능력을 시연한다.
재현 가능하고 미니멀한 트랜스포머 기반 비디오 생성을 안내하기 위한 차등 실험을 제공한다.

제안 방법

Train a VQ-VAE with 3D convolutions and axial attention to learn downsampled discrete latents of videos.
Model the latent sequence autoregressively with a GPT-like transformer using spatio-temporal position encodings.
Use learned cross-attention or conditional norms for action/class conditioning of the prior.
Decode the latent samples back to full-resolution video via the VQ-VAE decoder.
Train with maximum likelihood and apply dropout for regularization in the prior.

실험 결과

연구 질문

RQ1Can VideoGPT generate high-fidelity videos on real datasets like BAIR, UCF-101, and TGIF?
RQ2How do architectural choices (axial attention, latent size, codebook count, transformer depth) affect quality?
RQ3Is the approach competitive with state-of-the-art GANs for video generation?
RQ4What is the effect of conditioning mechanisms on conditional video generation?

주요 결과

VideoGPT achieves FVD of 103.3 on BAIR (vs. 103.3 for TrIVD-GAN-FP) and 94±2 for Video Transformer, showing competitive quality with GAN-based methods.
Unconditional VideoGPT samples on UCF-101 achieve IS 24.69±0.30, competitive with several baselines and below DVD-GAN’s 32.97±1.7.
Ablations show axial attention improves reconstruction NMSE from 0.0041 to 0.0033 and FVD from 15.3 to 14.9 on BAIR (VQ-VAE).
Larger prior network capacity (up to 8-16 transformer layers) improves FVD and sample quality on BAIR.
Optimal latent configuration around 8×32×32 (space-time downsampling) yields best sample quality while maintaining reconstruction fidelity.
Using a single VQ-VAE codebook often yields best sample quality compared to multiple codebooks in their experiments.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.