QUICK REVIEW

[논문 리뷰] Learning to Decompose and Disentangle Representations for Video Prediction

Jun-Ting Hsieh, Bingbin Liu|arXiv (Cornell University)|2018. 06. 11.

Generative Adversarial Networks and Image Synthesis참고 문헌 45인용 수 107

한 줄 요약

DDPAE는 비디오를 자동으로 구성 요소로 분해하고 각 구성 요소를 저차원 시간적 다이나믹스로 분리(disentangle)하여 명시적 감독 없이 픽셀로부터 미래 프레임을 예측하는 프레임워크이다.

ABSTRACT

Our goal is to predict future video frames given a sequence of input frames. Despite large amounts of video data, this remains a challenging task because of the high-dimensionality of video frames. We address this challenge by proposing the Decompositional Disentangled Predictive Auto-Encoder (DDPAE), a framework that combines structured probabilistic models and deep networks to automatically (i) decompose the high-dimensional video that we aim to predict into components, and (ii) disentangle each component to have low-dimensional temporal dynamics that are easier to predict. Crucially, with an appropriately specified generative model of video frames, our DDPAE is able to learn both the latent decomposition and disentanglement without explicit supervision. For the Moving MNIST dataset, we show that DDPAE is able to recover the underlying components (individual digits) and disentanglement (appearance and location) as we would intuitively do. We further demonstrate that DDPAE can be applied to the Bouncing Balls dataset involving complex interactions between multiple objects to predict the video frame directly from the pixels and recover physical states without explicit supervision.

연구 동기 및 목표

고차원의 비디오를 구성 요소로 분해하여 예측 복잡성을 줄이는 것을 촉진한다.
감독 없이 분해된 구성 요소와 이들의 저차원 시간적 다이나믹스를 자동으로 발견한다.
분해와 disentanglement가 Moving MNIST와 Bouncing Balls에서 미래 프레임 예측을 향상시킴을 보인다.

제안 방법

DDPAE를 깊은 매개변수화가 있는 구조화된 확률 모델로 공식화한다.
비디오를 N개의 구성 요소로 분해하되 각 구성 요소는 공통 콘텐츠를 공유하고 저차원 포즈를 가진다.
각 구성 요소에 대해 저차원 포즈 다이나믹스를 예측하고 공간 변환기를 갖춘 프레임 디코더를 통해 프레임을 재구성한다.
변분 오토인코더 프레임워크로 잠재 변수를 추론하고 ELBO를 최적화한다.

실험 결과

연구 질문

RQ1분해되고 disentangled된 저차원 다이나믹스를 갖는 비디오의 자동 분해가 더 정확한 미래 프레임 예측을 촉진할 수 있는가?
RQ2이동하는 숫자와 상호 작용하는 객체를 다루는 데이터셋에서 분해와 disentanglement를 모두 학습하는 것이 예측을 향상시키는가?
RQ3모델이 상호 의존하는 구성 요소와 불확정한 객체 수를 다룰 수 있는가?
RQ4감독 없이도 픽셀로부터 해석 가능한 구성 요소(예: 숫자, 공)를 얼마나 잘 복원하는가?

주요 결과

모델	BCE	MSE
Shi et al. [45]	367.2	-
Srivastava et al. [33]	341.2	-
Brabandere et al. [5]	285.2	-
Patraucean et al. [26]	262.6	-
Ghosh et al. [10]	241.8	167.9
Kalchbrenner et al. [15]	87.6	-
MCNet [39]	1308.2	173.2
DRNet [6]	862.7	163.9
Ours w/o Decomposition	325.5	77.6
Ours w/o Disentanglement	296.1	65.6
Ours (DDPAE)	223.0	38.9

DDPAE는 Moving MNIST에서 분해나 disentanglement가 없는 baseline에 비해 현저히 우수하다(낮은 BCE와 MSE).
모델은 숫자를 구성 요소로 분리하고 외관(콘텐츠)을 위치(포즈)와 자동으로 분리(disentangle)하는 것을 배운다.
Bouncing Balls에서 DDPAE는 충돌과 같은 복잡한 상호 작용을 픽셀에서 직접 예측하고 명시적 상태 모델링 없이 물리적 속성을 복구한다.
DDPAE는 필요하지 않을 때 여분의 구성 요소를 비어 있는 것으로 할당하여 알 수 없거나 가변적인 구성 요소 수에 대한 강인성을 보인다.
상호 의존적 구성 요소 모델링이 독립 구성 요소에 비해 충돌 중 속도 예측을 개선한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.