QUICK REVIEW

[论文解读] Learning to Decompose and Disentangle Representations for Video Prediction

Jun-Ting Hsieh, Bingbin Liu|arXiv (Cornell University)|Jun 11, 2018

Generative Adversarial Networks and Image Synthesis参考文献 45被引用 107

一句话总结

DDPAE 是一个框架，能够自动将视频分解为多个组成部分，并将每个组成部分解缠为低维时序动力学，以在没有显式监督的情况下从像素预测未来帧。

ABSTRACT

Our goal is to predict future video frames given a sequence of input frames. Despite large amounts of video data, this remains a challenging task because of the high-dimensionality of video frames. We address this challenge by proposing the Decompositional Disentangled Predictive Auto-Encoder (DDPAE), a framework that combines structured probabilistic models and deep networks to automatically (i) decompose the high-dimensional video that we aim to predict into components, and (ii) disentangle each component to have low-dimensional temporal dynamics that are easier to predict. Crucially, with an appropriately specified generative model of video frames, our DDPAE is able to learn both the latent decomposition and disentanglement without explicit supervision. For the Moving MNIST dataset, we show that DDPAE is able to recover the underlying components (individual digits) and disentanglement (appearance and location) as we would intuitively do. We further demonstrate that DDPAE can be applied to the Bouncing Balls dataset involving complex interactions between multiple objects to predict the video frame directly from the pixels and recover physical states without explicit supervision.

研究动机与目标

通过将高维视频分解为组件来降低预测复杂度的动机。
在无监督的情况下自动发现分解后的组件及其低维时序动态。
证明分解与解缠在 Moving MNIST 和 Bouncing Balls 数据集上可提升未来帧预测。

提出的方法

将 DDPAE 表述为一个具有深参数化的结构化概率模型。
将视频分解为 N 个组件，每个组件具有共享内容和低维姿态。
为每个组件预测低维姿态动态，并通过带有空间变换器的帧解码器重建帧。
在变分自编码器框架下推断潜变量并优化 ELBO。

实验结果

研究问题

RQ1将视频自动分解为具有解缠且低维动态的组件是否能促进更精确的未来帧预测？
RQ2同時学习分解与解缠是否能在包含移动数字和交互物体的数据集上提升预测？
RQ3模型能否处理相互依赖的组件以及未知数量的对象？
RQ4DDPAE 在无监督的情况下从像素中恢复可解释的组件（如数字、球）到底有多好。

主要发现

在 Moving MNIST 上，DDPAE 在不进行分解或不进行解缠的基线方法中显著优越（BCE 和 MSE 更低）。
模型自动学习将数字分离为组件并将外观（内容）与位置（姿态）解缠。
在 Bouncing Balls 上，DDPAE 能直接从像素预测复杂交互（碰撞）并在无需显式状态建模的情况下恢复物理属性。
DDPAE 通过在不必要时将额外组件置为空来表现出对未知/可变数量组件的鲁棒性。
与独立组件相比，考虑相互依赖的组件建模在碰撞时的速度预测上有改善。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。