QUICK REVIEW

[논문 리뷰] FitVid: Overfitting in Pixel-Level Video Prediction

Mohammad Babaeizadeh, Mohammad Saffar|arXiv (Cornell University)|2021. 06. 24.

Advanced Image Processing Techniques참고 문헌 110인용 수 29

한 줄 요약

FitVid는 합성곱 변분형 비디오 예측 모델이 기존 벤치마크에서 비슷한 매개변수 수를 가진 이전 모델과 유사한 규모로 과적합할 수 있으며, 데이터 증강이 과적합을 완화하면서 여러 데이터셋과 메트릭 전반에서 최첨단 성능을 달성할 수 있음을 보여준다.

ABSTRACT

An agent that is capable of predicting what happens next can perform a variety of tasks through planning with no additional training. Furthermore, such an agent can internally represent the complex dynamics of the real-world and therefore can acquire a representation useful for a variety of visual perception tasks. This makes predicting the future frames of a video, conditioned on the observed past and potentially future actions, an interesting task which remains exceptionally challenging despite many recent advances. Existing video prediction models have shown promising results on simple narrow benchmarks but they generate low quality predictions on real-life datasets with more complicated dynamics or broader domain. There is a growing body of evidence that underfitting on the training data is one of the primary causes for the low quality predictions. In this paper, we argue that the inefficient use of parameters in the current video models is the main reason for underfitting. Therefore, we introduce a new architecture, named FitVid, which is capable of severe overfitting on the common benchmarks while having similar parameter count as the current state-of-the-art models. We analyze the consequences of overfitting, illustrating how it can produce unexpected outcomes such as generating high quality output by repeating the training data, and how it can be mitigated using existing image augmentation techniques. As a result, FitVid outperforms the current state-of-the-art models across four different video prediction benchmarks on four different metrics.

연구 동기 및 목표

현 모델의 과소적합을 해결하기 위해 픽셀 수준의 비디오 예측에서 더 나은 파라미터 효율성의 필요성을 동기 부여한다.
FitVid를 최첨단 모델과 유사한 매개변수 수로도 상당한 과적합이 가능하도록 하는 아키텍처로 소개한다.
과적합 방지 및 일반화 가능성 촉진에서 데이터 증강의 역할을 조사한다.
증강이 여러 실제 비디오 예측 벤치마크에서 최첨단 성능을 내는 것을 보여준다.

제안 방법

고정된 가우시안 사전분포를 가진 비계층적 합성곱 변분 모델을 확률적 비디오 예측에 제안한다.
잔차 블록, 배치 정규화, Swish 활성화, Squeeze-and-Excite 모듈이 포함된 인코더-디코더 아키텍처를 사용한다.
프레임 전환을 예측하기 위한 두 층 LSTM으로 다이나믹스를 모델링하고, 완화 추론을 통한 가우시안 사후를 가지는 잠재 변수용 별도의 LSTM 기반 인코더를 사용한다.
Adam 옵티마이저를 사용하여 커리큘럼 학습이나 학습된 사전분포 없이 evidence lower bound를 최대화하여 학습한다.
과적합을 완화하고 일반화를 향상시키기 위해 RandAugment와 RandCrop 데이터 증강을 적용한다.

실험 결과

연구 질문

RQ1매개변수 효율이 높은 비디오 예측 모델이 지나치게 큰 아키텍처나 복잡한 학습 스케줄에 의지하지 않고도 실제 데이터 세트에서 높은 품질의 미래 프레임 예측을 달성할 수 있는가?
RQ2강력한 데이터 증강을 도입하면 기존 벤치마크의 과적합 경향이 드러나고 일반화가 향상되는가?
RQ3다양한 데이터셋에 걸쳐 훈련 정확도와 보류된 영상 품질 간의 격차를 증강이 어느 정도까지 좁힐 수 있는가?

주요 결과

데이터셋	GHVAE FVD	GHVAE PSNR	GHVAE SSIM	GHVAE LPIPS	SVG FVD	SVG PSNR	SVG SSIM	SVG LPIPS	FitVid FVD	FitVid PSNR	FitVid SSIM	FitVid LPIPS
RobNet	95.2	24.7	89.1	0.036	123.2	23.9	87.8	0.060	62.5	28.2	89.3	0.024
KITTI	552.9	15.8	51.2	0.286	1217.3	15.0	41.9	0.327	884.5	17.1	49.1	0.217
Human3.6M	355.2	26.7	94.6	0.018	-	-	-	-	154.7	36.2	97.9	0.012

FitVid는 4개의 어려운 비디오 예측 벤치마크에서 네 가지 메트릭에 걸쳐 최첨단 성과를 달성한다.
증강 없이, FitVid는 Human3.6M과 KITTI에서 명확한 과적합을 보며(더 큰 매개변수 수를 가진 RoboNet에서도 마찬가지).
RandAugment와 RandCrop은 효과적으로 과적합을 완화하고 보류된 영상에 대한 일반화를 향상시킨다.
SVG 및 GHVAE와 비교하여, 데이터 증강을 받은 FitVid는 RoboNet과 Human3.6M에서 우수하며 KITTI에서는 GHVAE와 근접하거나 이를 상회한다.
BAIR에서 FitVid는 대부분의 기존 비변분 방법들보다 우수하고 매개변수 수를 고려하면 Video Transformer와도 경쟁력이 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.