QUICK REVIEW

[论文解读] V4D:4D Convolutional Neural Networks for Video-level Representation Learning

Shiwen Zhang, Sheng Guo|arXiv (Cornell University)|Feb 18, 2020

Human Pose and Action Recognition参考文献 31被引用 48

一句话总结

V4D 引入了视频级别的 4D CNN，采用 4D 卷积和残差块来建模视频动作识别中的长程时空演化，超越基于剪辑的 3D CNN。

ABSTRACT

Most existing 3D CNNs for video representation learning are clip-based methods, and thus do not consider video-level temporal evolution of spatio-temporal features. In this paper, we propose Video-level 4D Convolutional Neural Networks, referred as V4D, to model the evolution of long-range spatio-temporal representation with 4D convolutions, and at the same time, to preserve strong 3D spatio-temporal representation with residual connections. Specifically, we design a new 4D residual block able to capture inter-clip interactions, which could enhance the representation power of the original clip-level 3D CNNs. The 4D residual blocks can be easily integrated into the existing 3D CNNs to perform long-range modeling hierarchically. We further introduce the training and inference methods for the proposed V4D. Extensive experiments are conducted on three video recognition benchmarks, where V4D achieves excellent results, surpassing recent 3D CNNs by a large margin.

研究动机与目标

推动视频级别表示学习，超越基于剪辑的 3D CNNs，以捕捉长程时序演化。
提出 4D 卷积和残差 4D 块，以在整体视频表示中建模剪辑之间的交互。
将 4D 块集成到现有的 3D CNN 主干中，以实现分层的长程建模。
开发针对 V4D 的训练和视频级推理策略。
在多个基准测试（Mini-Kinetics、Kinetics-400、Something-Something-v1）上展示有效性。

提出的方法

提出一种视频级抽样策略，将视频分成 U 个动作单元，并从每个部分进行采样。
定义在形状为 (C, U, T, H, W) 的 V 张量上进行的 4D 卷积，以捕捉剪辑之间的交互。
通过在带有残差连接的 3D CNN 主干中集成 4D 卷积，创建残差 4D 卷积块。
使用基于置换的机制对齐维度，使 4D 块能够插入到标准的 3D CNN 中。
提供视频级推理过程，在多个采样表示上聚合预测。
探索不同的 4D 内核形状（例如 3x3x3x3，3x3x1x1）和放置位置（res3、res4、res5），以在性能与参数之间取得平衡。

实验结果

研究问题

RQ14D 卷积能否在动作识别中有效建模视频的长程时空演化？
RQ2将残差 4D 块整合到 3D CNN 主干中是否能提升超越剪辑基础方法的视频级表示？
RQ3动作单元数量（U）和内核配置对性能与效率的影响是什么？
RQ4在多种基准测试（Mini-Kinetics、Kinetics-400、Something-Something-v1）上，V4D 相对于 TSN 和基于剪辑的 3D CNNs 的表现如何？

主要发现

V4D 配合残差 4D 块在可比协议下实现比基于剪辑的 I3D-S 和 TSN 基线更高的准确率（例如 V4D ResNet18 在 Mini-Kinetics 上优于 I3D-S ResNet18 和 TSN+I3D-S ResNet18）。
内核选择影响性能，3x3x3x3 给出较强的结果，而更经济的 3x3x1x1 在实际使用中仍具竞争力。
将 4D 块放置在 res3 和 res4 处的收益优于其他放置，并在两个位置同时组合块可进一步提高准确性。
与若干最新方法相比，V4D 在 Kinetics-400（77.4 top-1，93.1 top-5，使用 V4D ResNet50）和 Something-Something-v1（50.4 top-1，使用 V4D ResNet50）上实现有竞争力或更优的结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。