QUICK REVIEW

[论文解读] Hierarchical Long-term Video Prediction without Supervision

Nevan Wichers, Ruben Villegas|arXiv (Cornell University)|Jun 12, 2018

Advanced Data Compression Techniques被引用 64

一句话总结

本文提出一个无监督的分层视频预测框架（EPVA），能够学习高层特征并在没有高层监督真值的情况下预测长期帧，在特征空间使用对抗损失以提升对 Human3.6M 的预测性能。

ABSTRACT

Much of recent research has been devoted to video prediction and generation, yet most of the previous works have demonstrated only limited success in generating videos on short-term horizons. The hierarchical video prediction method by Villegas et al. (2017) is an example of a state-of-the-art method for long-term video prediction, but their method is limited because it requires ground truth annotation of high-level structures (e.g., human joint landmarks) at training time. Our network encodes the input frame, predicts a high-level encoding into the future, and then a decoder with access to the first frame produces the predicted image from the predicted encoding. The decoder also produces a mask that outlines the predicted foreground object (e.g., person) as a by-product. Unlike Villegas et al. (2017), we develop a novel training method that jointly trains the encoder, the predictor, and the decoder together without highlevel supervision; we further improve upon this by using an adversarial loss in the feature space to train the predictor. Our method can predict about 20 seconds into the future and provides better results compared to Denton and Fergus (2018) and Finn et al. (2016) on the Human 3.6M dataset.

研究动机与目标

在高维视频中超越短期时域，推动长期视频预测。
在训练过程中消除对高层结构真实标签的需求。
通过分层框架将高层特征预测与低层像素生成解耦。

提出的方法

将输入帧编码到特征空间，并使用 LSTM 预测未来的高层特征。
使用带自适应掩模的视觉类比网络（VAN）从第一帧生成未来帧。
在没有高层监督的情况下联合训练编码器、预测器和 VAN，必要时可使用基于类比的损失。
在 EPVA 中，最小化像素级 L2 损失，并可将预测特征约束为编码器输出；在特征空间应用对抗损失以提升预测锐度。
在带对抗损失的 EPVA 中，训练一个带 Wasserstein 损失的 LSTM 判别器以区分预测特征序列与真实特征序列，并利用其反馈改善生成。

实验结果

研究问题

RQ1在没有监督的高层结构标注的情况下，能否实现长期视频预测？
RQ2在没有地面真值标注的情况下，端到端联合训练编码器、预测器和 VAN 是否能提升长期预测质量？
RQ3在特征空间进行对抗训练相比仅使用 L2 目标，是否能产生更清晰、更真实的长期预测？

主要发现

在 Human3.6M 和 toy 数据集上，EPVA 的长期预测比端到端的 L2 基线更清晰。
在一个 toy 跳动形状数据集上，EPVA 对预测形状的颜色正确率约为 97%，而 CDNA 基线约为 25%。
在 Human3.6M 上，EPVA Adversarial 在帧 64–127 的真实感人类评价方面显著优于 Finn 等人 (2016) 和 Denton & Fergus (2018)。
EPVA 方法可以揭示前景运动分割掩模，表明网络发现了运动物体结构。
利用学习的编码器特征进行姿态回归相对于基于 VGG 的特征实现约 9% 的相对误差下降。
特征空间中的对抗损失有助于减少模糊，提升长期真实感，相较于仅使用 L2。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。