QUICK REVIEW

[论文解读] Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks

Alberto Montes, Amaia Salvador|arXiv (Cornell University)|Aug 29, 2016

Human Pose and Action Recognition参考文献 11被引用 82

一句话总结

该论文提出了一种简单但有效的流水线，用于在未剪辑视频中进行时序动作检测，使用3D-CNN特征（C3D）输入带有LSTM单元的RNN，以对动作进行分类和定位。该方法在ActivityNet Challenge 2016上实现了0.5874的分类mAP和0.2237的检测mAP，通过后处理技术（包括平滑和阈值处理）进一步提升了定位精度。

ABSTRACT

This thesis explore different approaches using Convolutional and Recurrent Neural Networks to classify and temporally localize activities on videos, furthermore an implementation to achieve it has been proposed. As the first step, features have been extracted from video frames using an state of the art 3D Convolutional Neural Network. This features are fed in a recurrent neural network that solves the activity classification and temporally location tasks in a simple and flexible way. Different architectures and configurations have been tested in order to achieve the best performance and learning of the video dataset provided. In addition it has been studied different kind of post processing over the trained network's output to achieve a better results on the temporally localization of activities on the videos. The results provided by the neural network developed in this thesis have been submitted to the ActivityNet Challenge 2016 of the CVPR, achieving competitive results using a simple and flexible architecture.

研究动机与目标

为解决在未剪辑视频中对动作进行分类和时序定位的挑战，其中视频未预先分割。
开发一种简单、端到端可训练的框架，以利用视频序列中的空间和时序特征。
通过后处理技术优化RNN输出序列，从而提升检测性能。
在ActivityNet Challenge 2016基准上，为分类和时序定位任务均取得具有竞争力的结果。

提出的方法

从未剪辑视频的16帧片段中提取4096维的C3D fc6特征，并将输入调整为171×128尺寸。
将C3D特征序列输入堆叠的LSTM网络，使用dropout（p=0.5）进行序列建模和时序依赖性学习。
使用最终的softmax层输出每个16帧片段的类别概率，共K+1个类别，包含一个背景类别。
应用均值滤波器（k=5）对时间序列上的预测活动概率进行平滑处理，以减少序列中的噪声。
应用阈值γ，仅保留活动概率高于γ的片段，并为其分配预测类别。
将最终的视频级类别确定为所有片段中平均概率最大的类别。

实验结果

研究问题

RQ1使用预提取的C3D特征的简单RNN架构是否能在视频分类和时序动作检测中均取得具有竞争力的性能？
RQ2通过平滑和阈值处理进行后处理，如何影响动作预测的定位精度？
RQ3在未剪辑视频动作检测中，何种RNN架构（层数和单元数）能在性能与泛化能力之间取得最佳平衡？
RQ4数据集中的类别不平衡如何影响模型训练，应采取何种策略来缓解其影响？

主要发现

单层512-LSTM配置在分类mAP上达到最佳的0.5938，优于更深的架构，原因在于减少了过拟合。
3x1024-LSTM模型在Hit@3指标上达到最高的0.7437，表明其在前3名预测中的准确性优异。
通过均值滤波器（k=5）和阈值γ=0.2进行后处理，检测mAP提升至0.22513，为所有测试配置中的最高值。
最优后处理参数为γ=0.2和k=5，实现了定位任务中精确率与召回率的最佳平衡。
该模型在ActivityNet 2016测试集上实现了0.2237的检测mAP，证明其在时序定位任务上表现强劲。
结果表明，即使采用结合C3D特征与RNN的简单流水线，也能在无需端到端训练的情况下取得具有竞争力的结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。