QUICK REVIEW

[论文解读] Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

Zhaofan Qiu, Ting Yao|arXiv (Cornell University)|Nov 28, 2017

Human Pose and Action Recognition参考文献 34被引用 251

一句话总结

论文提出伪3D (P3D) 块，在残差网络中使用二维空间滤波器加一维时间滤波器来模拟3D卷积，形成P3D ResNet变体，在视频表征方面优于传统的2D和3D CNN。

ABSTRACT

Convolutional Neural Networks (CNN) have been regarded as a powerful class of models for image recognition problems. Nevertheless, it is not trivial when utilizing a CNN for learning spatio-temporal video representation. A few studies have shown that performing 3D convolutions is a rewarding approach to capture both spatial and temporal dimensions in videos. However, the development of a very deep 3D CNN from scratch results in expensive computational cost and memory demand. A valid question is why not recycle off-the-shelf 2D networks for a 3D CNN. In this paper, we devise multiple variants of bottleneck building blocks in a residual learning framework by simulating $3 imes3 imes3$ convolutions with $1 imes3 imes3$ convolutional filters on spatial domain (equivalent to 2D CNN) plus $3 imes1 imes1$ convolutions to construct temporal connections on adjacent feature maps in time. Furthermore, we propose a new architecture, named Pseudo-3D Residual Net (P3D ResNet), that exploits all the variants of blocks but composes each in different placement of ResNet, following the philosophy that enhancing structural diversity with going deep could improve the power of neural networks. Our P3D ResNet achieves clear improvements on Sports-1M video classification dataset against 3D CNN and frame-based 2D CNN by 5.3% and 1.8%, respectively. We further examine the generalization performance of video representation produced by our pre-trained P3D ResNet on five different benchmarks and three different tasks, demonstrating superior performances over several state-of-the-art techniques.

研究动机与目标

推动高效学习时空视频表征，而不使用完整的3D卷积神经网络。
开发瓶颈块，使用1x3x3空间滤波和3x1x1时域滤波来模拟3x3x3卷积。
探索不同的块设计（P3D-A/B/C），并在ResNet中混合它们以提高性能。
证明P3D ResNet在多个视频数据集上优于3D CNN和基于帧的CNN。
展示在图像上对二维空间滤波器进行预训练、在视频数据上学习一维时间滤波器能够获得强泛化能力。

提出的方法

将3D卷积定义并分解为二维空间（1x3x3）和一维时间（3x1x1）组件。
提出三种P3D块设计（A、B、C），在S（空间）路径和T（时间）路径之间具有不同的直接/间接连接。
采用瓶颈方案，在空间/时间滤波器周围使用1x1的降维/恢复。
通过用P3D块替换ResNet块并混合A/B/C块以实现结构多样性，创建P3D ResNet。
在Sports-1M（大规模视频）上进行预训练，并作为跨任务的通用视频表征提取器进行评估。
在UCF101、ActivityNet、ASLAN、YUPENN和Dynamic Scene上与ResNet-50、C3D及其他基线进行比较。

实验结果

研究问题

RQ1伪3D块能否有效替代完整的3D卷积以捕捉视频中的时空信息？
RQ2不同的P3D块设计（A、B、C）是否提供互补的好处，混合它们是否提升性能？
RQ3在图像数据（用于空间）+视频数据（用于时间）上预训练的P3D ResNet是否比纯3D CNN或基于帧的方法更有效？
RQ4P3D ResNet作为通用视频表征在不同数据集和任务上的表现如何？

主要发现

模型规模	速度	准确率
ResNet-50	92MB	15.0 frame/s	80.8%
P3D-A ResNet	98MB	9.0 clip/s	83.7%
P3D-B ResNet	98MB	8.8 clip/s	82.8%
P3D-C ResNet	98MB	8.6 clip/s	83.0%
P3D ResNet	98MB	8.8 clip/s	84.2%

P3D变体在性能上超越ResNet-50，并与C3D相比具有竞争力甚至领先，同时模型规模适中且运行时高效。
混合P3D-A、P3D-B和P3D-C（完整的P3D ResNet）相比任一单一变体提供额外的准确性提升，表明架构多样性有价值。
在Sports-1M上，P3D ResNet在视频级准确性方面达到更高指标（47.9% clip hit@1；66.4% video hit@1；87.4% video hit@5），与若干基线相比。
在UCF101上，仅帧输入的P3D ResNet达到88.6%的top-1准确率，超过ResNet-152和C3D；与IDT融合后达到93.7%。
在ActivityNet上，P3D ResNet的Top-1 75.12%、Top-3 87.71%、MAP 78.86%，优于包括IDT、C3D和ResNet-152基线在内的若干方法。
可视化结果显示P3D ResNet同时捕捉空间模式和时间运动，t-SNE表明P3D ResNet 表征的聚类在语义上更清晰。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。