QUICK REVIEW

[论文解读] Learning to Generate Long-term Future via Hierarchical Prediction

Ruben Villegas, Shuicheng Yan|arXiv (Cornell University)|Apr 19, 2017

Human Pose and Action Recognition参考文献 6被引用 180

一句话总结

提出一个分层框架，先预测高层结构（姿态），再从单帧观测中生成长期未来帧，避免像素级递归导致的误差累积。并在 Human3.6M 和 Penn Action 上证明了更好的长期视频预测。

ABSTRACT

We propose a hierarchical approach for making long-term predictions of future frames. To avoid inherent compounding errors in recursive pixel-level prediction, we propose to first estimate high-level structure in the input frames, then predict how that structure evolves in the future, and finally by observing a single frame from the past and the predicted high-level structure, we construct the future frames without having to observe any of the pixel-level predictions. Long-term video prediction is difficult to perform by recurrently observing the predicted frames because the small errors in pixel space exponentially amplify as predictions are made deeper into the future. Our approach prevents pixel-level error propagation from happening by removing the need to observe the predicted frames. Our model is built with a combination of LSTM and analogy based encoder-decoder convolutional neural networks, which independently predict the video structure and generate the future frames, respectively. In experiments, our model is evaluated on the Human3.6M and Penn Action datasets on the task of long-term pixel-level video prediction of humans performing actions and demonstrate significantly better results than the state-of-the-art.

研究动机与目标

动机并解决由于递归生成帧导致的累积误差使长期像素级视频预测变得困难的问题。
提出一种分层方法，先预测高层结构再利用该结构生成未来帧。
通过在预测时避免依赖观测先前生成的帧来降低误差传播。
在真实世界的人体动作数据集（Penn Action 和 Human3.6M）上证明其有效性。

提出的方法

使用 Hourglass 姿态估计器从观测帧估计高层结构（2D 姿态热图）。
使用序列到序列的 LSTM 根据过去的姿态动态预测未来姿态，不对生成的姿态进行回馈。
通过视觉-结构类比生成未来帧：在共享的图像-结构嵌入下，基于预测的未来姿态差异对最后一个观测帧进行变换。
将姿态预测器和图像生成器分别训练；在测试时结合以实现多步预测。
通过图像质量、特征空间相似性和对抗真实感（带错配感知判别器）的复合损失进行优化。

实验结果

研究问题

RQ1在将高层结构置于帧之前预测的情况下，是否可以通过避免像素级误差累积来改进长期像素级视频预测？
RQ2基于姿态的分层预测在具有挑战性的人类动作数据集上生成现实未来帧的效果如何？
RQ3视觉-结构类比机制是否能够基于预测的高层结构准确生成未来帧？
RQ4训练策略（结构预测器和图像生成器分开训练）对长期预测性能的影响是什么？
RQ5该方法是否能在真实数据集上生成长序列（高达 128 步）比像素对像素递归方法更好？

主要发现

该分层方法在 Penn Action 和 Human3.6M 上实现长达 128 帧的长期预测，优于基线。
一个基于姿态的 LSTM 从过去的姿态数据预测未来姿态序列，避免错误在生成帧中的传播。
具有共享嵌入的视觉-结构类比在给定预测结构的条件下产生高质量的未来帧，无需观察预测的帧。
主观评估（AMT）和动作识别测试显示感知真实感和正确的动作预测优于卷积LSTM和光流基线。
未对背景运动建模，方法专注于前景人体动作预测，并以单帧观测生成。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。