QUICK REVIEW

[论文解读] Unsupervised Learning of Disentangled Representations from Video

Emily Denton, Vighnesh Birodkar|arXiv (Cornell University)|May 31, 2017

Generative Adversarial Networks and Image Synthesis被引用 228

一句话总结

DrNet 从视频中学习内容（时间不变）和姿态（时间变化）解耦表示，采用新颖的对抗损失，使得长距离帧预测成为可能，并能够从任一组成部分进行有效分类。

ABSTRACT

We present a new model DrNET that learns disentangled image representations from video. Our approach leverages the temporal coherence of video and a novel adversarial loss to learn a representation that factorizes each frame into a stationary part and a temporally varying component. The disentangled representation can be used for a range of tasks. For example, applying a standard LSTM to the time-vary components enables prediction of future frames. We evaluate our approach on a range of synthetic and real videos, demonstrating the ability to coherently generate hundreds of steps into the future.

研究动机与目标

推动无标签的鲁棒视频表示的无监督学习。
将视频帧分解为静态内容分量和动态姿态分量。
引入对抗损失，确保姿态不携带任何片段特定的内容信息。
展示使用解耦特征进行长距离帧预测和分类。

提出的方法

两个编码器为每帧生成内容（E_c）和姿态（E_p）表示。
解码器（D）从拼接的内容和未来姿态特征中预测未来帧。
一个对抗判别器（C）强制姿态特征不能暴露片段身份。
一个相似性损失鼓励内容特征随时间缓慢变化。
一个整体目标结合重建、相似性和对抗项，并具有可调权重。

实验结果

研究问题

RQ1在没有监督的情况下，视频帧能否分解为时间不变的内容和时间变化的姿态？
RQ2对姿态特征进行对抗训练是否在保持预测重构的同时强化内容/姿态的解耦？
RQ3解耦表示是否能够支持准确的长距离视频预测与下游分类任务？

主要发现

模型在合成与真实视频上展示了干净的内容/姿态分解。
使用一个简单的基于时间的LSTM在姿态特征上进行预测、并使用来自最后观测帧的固定内容，能够实现数百步的长距离帧预测。
内容特征支持语义分类，而姿态特征支持动作预测。
对抗损失对于实现解耦至关重要；移除它会降低内容/姿态分离和分类性能。
在 NORB 数据集上，当 β=0.1 时，内容特征获得较高准确性，而在不同 β 设置下姿态特征表现不同（见表格）。
该方法在真实视频（KTH）和合成数据上，与基线相比，获得具有竞争力或更有利的定性结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。