QUICK REVIEW

[论文解读] Unsupervised Learning of Video Representations using LSTMs

Nitish Srivastava, Elman Mansimov|arXiv (Cornell University)|Feb 16, 2015

Human Pose and Action Recognition参考文献 30被引用 1,663

一句话总结

本文提出了一种基于LSTM的无监督自编码器与未来预测模型，用于从未经剪辑的视频序列中学习视频表征。通过在无标签的YouTube视频上进行训练，该模型学习到解耦且可泛化的特征，显著提升了动作识别的准确率——尤其在标签样本极少的情况下，展示了在UCF-101和HMDB-51数据集上强大的迁移学习性能。

ABSTRACT

We use multilayer Long Short Term Memory (LSTM) networks to learn representations of video sequences. Our model uses an encoder LSTM to map an input sequence into a fixed length representation. This representation is decoded using single or multiple decoder LSTMs to perform different tasks, such as reconstructing the input sequence, or predicting the future sequence. We experiment with two kinds of input sequences - patches of image pixels and high-level representations ("percepts") of video frames extracted using a pretrained convolutional net. We explore different design choices such as whether the decoder LSTMs should condition on the generated output. We analyze the outputs of the model qualitatively to see how well the model can extrapolate the learned video representation into the future and into the past. We try to visualize and interpret the learned features. We stress test the model by running it on longer time scales and on out-of-domain data. We further evaluate the representations by finetuning them for a supervised learning problem - human action recognition on the UCF-101 and HMDB-51 datasets. We show that the representations help improve classification accuracy, especially when there are only a few training examples. Even models pretrained on unrelated datasets (300 hours of YouTube videos) can help action recognition performance.

研究动机与目标

通过时间序列建模，无监督地学习有意义且解耦的视频表征。
评估使用LSTM进行无监督预训练是否能提升下游监督动作识别任务的性能。
研究不同训练目标（重建 vs. 未来预测）对表征质量的影响。
分析所学表征在训练时间尺度之外的泛化与外推能力。
评估在无关视频数据（如300小时YouTube视频）上学习的表征在动作识别基准上的可迁移性。

提出的方法

使用多层LSTM编码器将视频帧序列压缩为固定长度的潜在表征。
使用一个或多个解码器LSTM来重建输入序列或从编码表征中预测未来帧。
采用两种主要训练目标：自编码（重建）和未来预测，通过组合模型同时实现两者。
使用两种输入类型：原始图像块（如MNIST数字）和来自预训练ImageNet卷积神经网络的高层感知特征。
通过将生成输出反馈至解码器实现条件解码，对比有无条件输入时的性能表现。
通过在UCF-101和HMDB-51数据集上微调编码器，评估表征在监督动作识别任务中的表现。

实验结果

研究问题

RQ1无监督的基于LSTM的模型能否在无标签的情况下学习到捕捉运动与外观结构的通用视频表征？
RQ2与单独使用任一目标相比，同时结合重建与未来预测目标如何影响所学表征的质量？
RQ3在无关视频数据（如300小时YouTube视频）上预训练的表征，能在多大程度上提升标签样本有限的动作识别性能？
RQ4模型在超出训练序列长度的情况下，对运动与外观的外推能力如何？
RQ5将解码器的自身生成输出作为条件输入，是否能提升未来预测或表征学习的质量？

主要发现

结合自编码与未来预测目标的复合模型在动作识别任务中表现最佳，在UCF-101上达到75.8%的准确率，在HMDB-51上达到44.0%。
在300小时YouTube视频上进行预训练显著提升了动作识别准确率，尤其在标签样本极少时优势明显。
模型在训练时间尺度之外仍能持续生成合理的运动序列，尽管长期预测中物体细节有所退化。
将解码器的自身输出作为条件输入并未显著提升监督任务性能，但带来了略优的定性未来预测结果。
该模型优于标准LSTM基线模型，并在仅使用RGB数据的情况下达到或超过SOTA模型（如LRCN和C3D）的性能。
结合RGB与光流预测的模型在UCF-101上达到84.3%的准确率，显示出与其他模态融合的强大潜力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。