QUICK REVIEW

[论文解读] Temporal Pyramid Pooling Based Convolutional Neural Networks for Action Recognition

Peng Wang, Yuanzhouhan Cao|arXiv (Cornell University)|Mar 4, 2015

Human Pose and Action Recognition参考文献 24被引用 35

一句话总结

该论文提出了一种基于时间金字塔池化（TPP）的CNN架构，通过编码层和多级时间池化结合外观特征与运动特征，实现对任意帧数视频的动作识别。该方法通过利用预训练的图像CNN进行权重初始化，在Hollywood2和HMDB51数据集上实现了最先进性能，同时显著减少了对训练数据的需求。

ABSTRACT

Encouraged by the success of Convolutional Neural Networks (CNNs) in image classification, recently much effort is spent on applying CNNs to video based action recognition problems. One challenge is that video contains a varying number of frames which is incompatible to the standard input format of CNNs. Existing methods handle this issue either by directly sampling a fixed number of frames or bypassing this issue by introducing a 3D convolutional layer which conducts convolution in spatial-temporal domain. To solve this issue, here we propose a novel network structure which allows an arbitrary number of frames as the network input. The key of our solution is to introduce a module consisting of an encoding layer and a temporal pyramid pooling layer. The encoding layer maps the activation from previous layers to a feature vector suitable for pooling while the temporal pyramid pooling layer converts multiple frame-level activations into a fixed-length video-level representation. In addition, we adopt a feature concatenation layer which combines appearance information and motion information. Compared with the frame sampling strategy, our method avoids the risk of missing any important frames. Compared with the 3D convolutional method which requires a huge video dataset for network training, our model can be learned on a small target dataset because we can leverage the off-the-shelf image-level CNN for model parameter initialization. Experiments on two challenging datasets, Hollywood2 and HMDB51, demonstrate that our method achieves superior performance over state-of-the-art methods while requiring much fewer training data.

研究动机与目标

解决基于CNN的动作识别中可变长度视频输入的问题，此类输入与标准固定输入CNN不兼容。
通过在小规模目标数据集上实现端到端训练，避免帧采样风险并降低对大规模视频数据集的依赖。
通过分层池化显式建模时间结构，提升视频级表征学习能力。
通过早期融合策略有效融合外观与运动特征，以提升识别准确率。

提出的方法

提出一种新型网络模块，结合编码层与时间金字塔池化层，将可变长度帧激活转换为固定长度的视频级表征。
采用双流架构：一通道用于外观特征（来自预训练的ImageNet CNN），另一通道用于运动特征（使用密集轨迹和MBH描述符）。
应用多级时间金字塔池化，覆盖完整视频并将其划分为b个段落，以捕捉多尺度时间动态。
在最终分类前，通过特征拼接层实现外观与运动特征的早期融合。
利用预训练的ImageNet CNN（如GoogLeNet）进行特征提取与权重初始化，降低在小数据集上的过拟合风险。
对运动特征采用Fisher向量编码，因其已为可池化状态；而对CNN最后一层卷积特征图则额外应用编码层。

实验结果

研究问题

RQ1基于CNN的动作识别模型是否能在无需帧采样或3D卷积的情况下处理可变长度视频输入？
RQ2时间金字塔池化是否通过建模多尺度时间结构，改善了视频级表征？
RQ3外观与运动特征的早期融合是否优于晚期融合在动作识别中的表现？
RQ4与3D CNN相比，该方法在多大程度上减少了对大规模视频数据集的需求？

主要发现

在Hollywood2数据集上，该方法使用早期融合达到67.5%的准确率，较晚期融合（64.7%）高出2.8个百分点。
在HMDB51数据集上，该方法使用早期融合达到59.7%的准确率，较晚期融合（57.7%）提升2个百分点。
最优的时间金字塔结构采用b=5个段落，在Hollywood2上达到44.2%的准确率，在HMDB51上达到41.3%，分别较基线（b=0）提升6.0和2.8个百分点。
该方法在Hollywood2和HMDB51上均实现了最先进性能，且相比3D CNN显著减少了对训练数据的需求。
编码层对CNN卷积特征（如FC7）具有显著增益，但对已编码的运动特征（如Fisher向量）增益甚微，证实其在特征标准化中的作用。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。