QUICK REVIEW

[论文解读] An End-to-end 3D Convolutional Neural Network for Action Detection and Segmentation in Videos

Rui Hou, Chen Chen|arXiv (Cornell University)|Nov 30, 2017

Human Pose and Action Recognition参考文献 1被引用 41

一句话总结

本文提出一种用于视频动作检测与分割的端到端3D CNN框架，引入两种方法：基于自顶向下管路提议的Tube-CNN（T-CNN）和基于自底向上像素级动作分割的分割驱动CNN（ST-CNN）。ST-CNN在DAVIS数据集上实现了最先进性能，平均交并比达77.6%，在低对比度和动态视频等具有挑战性的场景中优于先前方法。

ABSTRACT

In this paper, we propose an end-to-end 3D CNN for action detection and segmentation in videos. The proposed architecture is a unified deep network that is able to recognize and localize action based on 3D convolution features. A video is first divided into equal length clips and next for each clip a set of tube proposals are generated based on 3D CNN features. Finally, the tube proposals of different clips are linked together and spatio-temporal action detection is performed using these linked video proposals. This top-down action detection approach explicitly relies on a set of good tube proposals to perform well and training the bounding box regression usually requires a large number of annotated samples. To remedy this, we further extend the 3D CNN to an encoder-decoder structure and formulate the localization problem as action segmentation. The foreground regions (i.e. action regions) for each frame are segmented first then the segmented foreground maps are used to generate the bounding boxes. This bottom-up approach effectively avoids tube proposal generation by leveraging the pixel-wise annotations of segmentation. The segmentation framework also can be readily applied to a general problem of video object segmentation. Extensive experiments on several video datasets demonstrate the superior performance of our approach for action detection and video object segmentation compared to the state-of-the-arts.

研究动机与目标

为解决使用深度学习进行视频时空动作检测的挑战，特别是高计算成本和缺乏大规模标注视频数据的问题。
克服依赖锚框且需大量标注边界框进行回归的自顶向下检测方法的局限性。
通过用密集像素级分割图替代粗粒度边界框提议，提升定位精度。
开发一种统一的端到端3D CNN框架，联合学习动作识别与定位的时空特征。
在DAVIS和THUMOS14等基准数据集上展示优越性能，尤其在运动模糊和低对比度等复杂场景中表现更优。

提出的方法

该方法将输入视频划分为等长片段，并使用管路提议网络（TPN）从3D CNN特征中生成3D管路提议。
通过动作性分数和时空重叠，将相邻片段的管路提议连接，形成完整的动作管路。
应用管路兴趣区域（ToI）池化层，从连接的管路中提取固定尺寸特征，用于动作分类。
采用编码器-解码器3D CNN架构实现端到端的像素级动作分割，以密集前景图预测替代管路提议生成。
利用分割图生成边界框，实现无需依赖锚框先验的自底向上检测策略。
ST-CNN变体通过单次前向传播处理片段，通过消除两阶段流水线，使推理速度比T-CNN快3倍。

实验结果

研究问题

RQ1统一的3D CNN框架是否能在不依赖帧级提议生成的情况下，实现视频中端到端的动作检测与分割？
RQ2在定位精度和对视觉变化的鲁棒性方面，自底向上的像素级分割与自顶向下的管路提议检测相比表现如何？
RQ3编码器-解码器3D CNN架构是否能有效学习稀疏监督下的时空表征，以实现密集视频分割？
RQ4在基于3D CNN的动作检测中，两阶段（T-CNN）与单阶段（ST-CNN）检测流水线之间的计算效率权衡如何？
RQ5与最先进方法相比，所提方法在低对比度、运动模糊或小物体尺寸等具有挑战性的视频序列中表现如何？

主要发现

所提出的ST-CNN方法在DAVIS数据集上实现了77.6%的平均交并比，优于所有先前方法，包括ARP、LVO和FSEG。
在Blackswan、Scooter-Black和Car-Roundabout等具有挑战性的序列中，该方法取得了最高的交并比，表明其在低对比度和动态场景中表现更优。
定性对比显示，该方法能成功分割细粒度细节（如轮辋、腿部和尾巴），而其他方法则容易遗漏这些特征。
由于采用单阶段推理流水线，ST-CNN模型速度比T-CNN快3倍，处理40帧视频仅需0.7秒。
在DAVIS上，该方法实现了95.2%的召回率和94.7%的F1值，表明检测精度高且对物体运动与遮挡具有强鲁棒性。
该模型表现出优异的时间稳定性，衰减分数仅为2.3，显著低于大多数基线方法，表明其在帧间保持了稳定的分割性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。