QUICK REVIEW

[论文解读] VideoFlow: A Conditional Flow-Based Model for Stochastic Video Generation

M. N. Anil Kumar, Mohammad Babaeizadeh|arXiv (Cornell University)|Mar 4, 2019

Video Analysis and Summarization参考文献 57被引用 76

一句话总结

VideoFlow 将基于流的生成模型扩展到条件视频预测，能够进行精确似然优化、产生多样的随机未来，并且比自回归视频模型更快地合成帧。

ABSTRACT

Generative models that can model and predict sequences of future events can, in principle, learn to capture complex real-world phenomena, such as physical interactions. However, a central challenge in video prediction is that the future is highly uncertain: a sequence of past observations of events can imply many possible futures. Although a number of recent works have studied probabilistic models that can represent uncertain futures, such models are either extremely expensive computationally as in the case of pixel-level autoregressive models, or do not directly optimize the likelihood of the data. To our knowledge, our work is the first to propose multi-frame video prediction with normalizing flows, which allows for direct optimization of the data likelihood, and produces high-quality stochastic predictions. We describe an approach for modeling the latent space dynamics, and demonstrate that flow-based generative models offer a viable and competitive approach to generative modelling of video.

研究动机与目标

激励多重未来皆可能的随机视频预测。
提出一种基于流的模型，用于对过去帧进行条件化以合成未来帧。
引入一个潜在动力学系统来建模流的潜在状态的时间演变。
使视频生成能够进行精确对数似然评估，以避免对抗性训练的伪影。

提出的方法

使用多尺度可逆流将帧 x_t 映射到潜在编码 z_t，并具有每帧潜在变量 z_t^(l)。
对所有 z_t^(l) 在时间上建模自回归先验 p(z)，以捕捉时间动态。
通过最大化精确对数似然来训练，结合流的雅可比项和自回归潜在先验。
在保留潜在动力学时间自回归的同时，对过去帧进行条件化流生成器。
利用带自回归先验的二维卷积以避免时间伪影并实现更长的序列。
可选地调整采样温度以权衡多样性与真实感。

实验结果

研究问题

RQ1带有精确似然优化的条件流模型是否能够产出高质量的随机视频预测？
RQ2在真实感、多样性和采样速度方面，VideoFlow 相较于基于 VAE 和自回归的视频预测方法有何差异？
RQ3自回归潜在动态先验是否能够在不使用昂贵的 3-D 卷积的情况下实现连贯的多帧视频生成？
RQ4在遮挡情况下，模型是否能够在保持时间一致性的同时生成更长时间范围的预测？

主要发现

VideoFlow 在 BAIR 数据上实现了具有竞争力的随机视频预测结果，接近最先进的基于 VAE 的模型。
在 Stochastic Movement Dataset 上，VideoFlow 的真实对假图像识别欺骗率（31.8%）高于 SAVP-VAE（16.4%）和 SV2P（17.5%）。
VideoFlow 在测试时的合成比像素级自回归模型更快（例如，在 NVIDIA P100 上，64x64x20 帧在不到 3.5 秒内完成）。
该模型直接优化数据似然，避免对抗性训练伪影，并能够通过对数似然直接评估。
VideoFlow 在 BAIR 无动作数据上的比特/像素(Bits-Per-Pixel)更好（1.87），优于若干基线，表明更强的基于似然的建模。
在 BAIR 上的潜在空间插值显示时间上连贯的运动，不同层级在不同尺度捕捉运动。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。