QUICK REVIEW

[论文解读] Long-Term Video Generation of Multiple Futures Using Human Poses.

Naoya Fushishita, Antonio Tejero-de-Pablos|arXiv (Cornell University)|Apr 16, 2019

Human Pose and Action Recognition参考文献 26被引用 2

一句话总结

本文提出了一种新颖的对抗性学习框架，通过结合潜在码和吸引力点，从人体姿态序列生成多种长期视频未来，以实现多样化的行为和轨迹。利用一维卷积网络，该方法可预测扩展的姿态序列并生成逼真的视频输出，在真实感、多样性及准确性方面优于先前方法。

ABSTRACT

Predicting the near-future from an input video is a useful task for applications such as autonomous driving and robotics. While most previous works predict a single future, multiple futures with different behaviors can possibly occur. Moreover, if the predicted future is too short, it may not be fully usable by a human or other system. In this paper, we propose a novel method for future video prediction capable of generating multiple long-term futures. This makes the predictions more suitable for real applications. First, from an input human video, we generate sequences of future human poses as the image coordinates of their body-joints by adversarial learning. We generate multiple futures by inputting to the generator combinations of a latent code (to reflect various behaviors) and an attraction point (to reflect various trajectories). In addition, we generate long-term future human poses using a novel approach based on unidimensional convolutional neural networks. Last, we generate an output video based on the generated poses for visualization. We evaluate the generated future poses and videos using three criteria (i.e., realism, diversity and accuracy), and show that our proposed method outperforms other state-of-the-art works.

研究动机与目标

解决现有视频生成方法中单一未来预测的局限性，以提升其在机器人学与自动驾驶等实际应用中的可用性。
通过实现超越即时帧的长期未来视频预测，克服现有方法的时间跨度限制。
在预测的未来中生成多样化的行为与轨迹变化，以反映真实的人体运动多样性。
通过生成视觉连贯且时间一致的视频输出，提升视频预测的实际应用价值。

提出的方法

使用对抗性学习从输入视频帧生成未来人体关节点坐标的序列。
在生成器中引入潜在码和吸引力点作为条件输入，以建模多样化的行为与运动轨迹。
采用一维卷积神经网络（1D-CNNs）建模姿态序列中的长期时间依赖性。
将生成的姿态序列转换为视觉视频输出，用于定性评估与可视化。
使用对抗性损失训练生成器，以增强预测姿态的真实感与一致性。
通过结合感知损失、对抗性损失与重建损失来优化模型，以平衡真实感、多样性与准确性。

实验结果

研究问题

RQ1视频预测模型能否生成反映多样化人体行为与轨迹的多个长期未来？
RQ2与循环网络或二维CNN方法相比，基于1D-CNN的架构在建模人体姿态序列中的长期时间动态方面表现如何？
RQ3潜在码与吸引力点的使用在多大程度上提升了预测未来的多样性与真实感？
RQ4在感知质量与运动合理性方面，生成的视频与真实视频相比表现如何？

主要发现

所提方法生成的多个长期视频未来在真实感、多样性和准确性方面均优于当前最先进基线方法。
潜在码与吸引力点的使用实现了行为与轨迹的有效解耦，显著提升了预测的多样性。
基于1D-CNN的姿态生成网络能有效建模长期时间依赖性，生成一致且合理的运动序列。
人工评估与定量指标均证实，该模型在生成未来的逼真度与多样性方面优于现有方法。
生成的视频输出视觉连贯，并在较长的时间范围内展现出合理的运动模式。
在标准基准测试下，该模型在真实感、多样性与准确性三项评估指标上均表现优异。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。