QUICK REVIEW

[论文解读] Action Recognition with Image Based CNN Features

Mahdyar Ravanbakhsh, Hossein Mousavi|arXiv (Cornell University)|Dec 13, 2015

Human Pose and Action Recognition参考文献 40被引用 60

一句话总结

该论文提出了一种基于预训练ImageNet CNN特征（fc7）的层次化CNN特征表示方法，用于视频动作识别，且无需在视频数据上进行微调。通过应用二值编码来追踪fc7特征的时序变化，并基于比特跳变提取关键帧，该方法利用多级金字塔结构捕捉运动信息，在KTH、UCF-Sports和UCF-11数据集上实现了最先进（SOTA）的准确率。

ABSTRACT

Most of human actions consist of complex temporal compositions of more simple actions. Action recognition tasks usually relies on complex handcrafted structures as features to represent the human action model. Convolutional Neural Nets (CNN) have shown to be a powerful tool that eliminate the need for designing handcrafted features. Usually, the output of the last layer in CNN (a layer before the classification layer -known as fc7) is used as a generic feature for images. In this paper, we show that fc7 features, per se, can not get a good performance for the task of action recognition, when the network is trained only on images. We present a feature structure on top of fc7 features, which can capture the temporal variation in a video. To represent the temporal components, which is needed to capture motion information, we introduced a hierarchical structure. The hierarchical model enables to capture sub-actions from a complex action. At the higher levels of the hierarchy, it represents a coarse capture of action sequence and lower levels represent fine action elements. Furthermore, we introduce a method for extracting key-frames using binary coding of each frame in a video, which helps to improve the performance of our hierarchical model. We experimented our method on several action datasets and show that our method achieves superior results compared to other state-of-the-arts methods.

研究动机与目标

为解决仅使用图像预训练的CNN特征识别视频中人类动作的挑战，避免昂贵的视频专用训练过程。
通过利用预训练CNN特征（fc7）的空间-时间变化来建模视频中的时序动态，而非依赖手工设计的时空描述符。
通过引入一种基于fc7特征二值编码的新颖关键帧提取方法，聚焦于信息丰富的视频片段，从而提升动作识别准确率。
通过在视频片段上构建多级金字塔结构，实现从粗粒度到细粒度子动作的层次化动作建模。
证明：当结合时序建模与关键帧选择时，基于图像的CNN特征可超越标准基准上的最先进方法。

提出的方法

使用在ImageNet上预训练的CNN（如GoogLeNet）从视频的每一帧中提取fc7特征。
通过向量量化或哈希方法将每个fc7特征转换为短二进制码，以实现高效的时序比较。
通过检测连续帧之间二进制码的比特跳变来识别关键帧，选择特征发生显著变化的视频片段。
将关键帧之间的视频划分为片段，并在每个片段上应用层次化金字塔结构，以在多个时间尺度上建模动作。
在金字塔的每一级应用PCA降维，并将所有层级的特征拼接成单一的视频级描述符。
从描述符中构建时序词袋直方图，并训练分类器（如SVM）用于动作识别。

实验结果

研究问题

RQ1仅通过增强时序建模，预训练的基于图像的CNN特征（fc7）是否能实现具有竞争力的动作识别性能？
RQ2对fc7特征进行二值编码在检测信息丰富的视频片段和提取关键帧方面有多高效？
RQ3通过在多个时间粒度上建模子动作，层次化金字塔结构在多大程度上提升了动作识别性能？
RQ4二进制码大小、窗口长度和金字塔深度等超参数如何影响识别准确率？
RQ5所提出的方法是否在标准动作识别基准上超越了现有最先进方法？

主要发现

该方法在KTH数据集上实现了最先进准确率，峰值性能达到94.0%，使用16位二进制码大小和4层金字塔结构。
在UCF-Sports数据集上，最佳准确率为98.0%，采用20帧重叠窗口和4层金字塔结构，表明对短视频片段具有强鲁棒性。
在UCF-11数据集上，该方法在25折留一法交叉验证下取得优异结果，准确率提升显著，优于此前最先进方法。
增加金字塔层级数量可提升识别准确率，表明细粒度时序建模能有效增强性能。
该方法在各数据集上均表现出一致的性能增益，最优性能出现在16位二进制码大小和20–30帧窗口长度时。
KTH数据集上的混淆矩阵显示各类动作识别准确率均较高，其中'walking'和'direction'动作的单类最高准确率达100%。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。