QUICK REVIEW

[论文解读] Stochastic Latent Residual Video Prediction

Jean-Yves Franceschi, Edouard Delasalles|arXiv (Cornell University)|Feb 21, 2020

Generative Adversarial Networks and Image Synthesis参考文献 82被引用 40

一句话总结

本文介绍了一种具备残差潜在动态更新的完全潜在随机视频预测模型，使非自回归预测与更高帧率的灵活性成为可能，并在多个基准数据集上显示出最先进的结果。

ABSTRACT

Designing video prediction models that account for the inherent uncertainty of the future is challenging. Most works in the literature are based on stochastic image-autoregressive recurrent networks, which raises several performance and applicability issues. An alternative is to use fully latent temporal models which untie frame synthesis and temporal dynamics. However, no such model for stochastic video prediction has been proposed in the literature yet, due to design and training difficulties. In this paper, we overcome these difficulties by introducing a novel stochastic temporal model whose dynamics are governed in a latent space by a residual update rule. This first-order scheme is motivated by discretization schemes of differential equations. It naturally models video dynamics as it allows our simpler, more interpretable, latent model to outperform prior state-of-the-art methods on challenging datasets.

研究动机与目标

在自监督方式下激励学习能够捕捉未来不确定性的视频预测模型。
提出一个完全潜在、非自回归的时序模型，带有随机残差更新规则。
将动态潜在状态演化与帧合成分离，以提升可解释性和效率。
引入一个内容变量，用以捕捉静态场景信息并辅助帧生成。
在标准随机视频预测基准上证明相较基线的性能提升。

提出的方法

将帧建模为从潜在状态 y_t 生成，采用随机残差更新 y_{t+1}=y_t+f_theta(y_t,z_{t+1})。
引入潜在随机变量 z_{t+1} ~ N(mu_theta(y_t), sigma_theta(y_t)) 来驱动动态。
使用一个内容变量 w，由条件帧推导得到，用以表示静态场景信息并输入给帧解码器。
应用带有包含 y_1 和 z_t 的 KL 项的变分推断的证据下界 ELBO，以及给定 y_t 和 w 的 x_t 的对数似然项。
通过 Delta t 允许可控的帧率，从而在不重新训练的情况下在不同帧率下生成。
以对 f_theta 的残差正则化项进行训练，以稳定动态。
使用基于 CNN 的生成器 g_theta 将帧 x_t 从 y_t 和 w 解码。

实验结果

研究问题

RQ1一个具有残差动态的全潜在随机模型是否能在随机视频预测中超越自回归或传统的 SSM 基方法？
RQ2将内容与动态分离是否能提升学习效率和预测质量？
RQ3残差动态形式是否与在高于训练帧率下生成视频兼容？
RQ4在标准随机视频预测基准上，该模型与最先进的基线相比如何？
RQ5模型是否能在 Moving MNIST、KTH、Human3.6M 和 BAIR 等数据集上鲁棒地预测多样化的未来？

主要发现

数据集	SV2P	SAVP	SVG	StructVRNN	Ours	Ours - Δt/2	Ours - MLP	Ours - GRU
KTH	636 p m 1	374 p m 3	377 p m 6	—	222 p m 3	244 p m 3	255 p m 4	240 p m 5
Human3.6M	—	—	—	556 p m 9	416 p m 5	415 p m 3	582 p m 4	1050 p m 20
BAIR	965 p m 17	152 p m 9	255 p m 4	—	163 p m 4	222 p m 42	162 p m 4	178 p m 10

在若干随机视频预测基准（KTH、Human3.6M、BAIR）上超越最先进的基线。
在长时域动态建模方面优于 SVG，以及同一残差框架的竞争变体（MLP/GRU）。
在不重新训练的情况下通过将 Delta t 折半即可实现更高帧率的视频生成，且性能保持。
将动态内容（y）与静态内容（w）解 disentangling，使模型能在潜在空间中专注于动力学。
带有随机潜在变量 z_t 的残余动态相较于纯确定性或自回归方法具有优势。
FVD 分数在各数据集上显示出强劲表现，在 KTH 和 Human3.6M 上尤有显著提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。