QUICK REVIEW

[论文解读] Video Classification with Channel-Separated Convolutional Networks

Du Tran, Heng Wang|arXiv (Cornell University)|Apr 4, 2019

Human Pose and Action Recognition参考文献 42被引用 84

一句话总结

本论文提出用于视频分类的 Channel-Separated Convolutional Networks (CSNs)，将通道交互与时空交互分离，以在减少 2–3 倍 FLOPs 的同时提升准确性，并在多个数据集上显示优于现有的 3D CNNs。

ABSTRACT

Group convolution has been shown to offer great computational savings in various 2D convolutional architectures for image classification. It is natural to ask: 1) if group convolution can help to alleviate the high computational cost of video classification networks; 2) what factors matter the most in 3D group convolutional networks; and 3) what are good computation/accuracy trade-offs with 3D group convolutional networks. This paper studies the effects of different design choices in 3D group convolutional networks for video classification. We empirically demonstrate that the amount of channel interactions plays an important role in the accuracy of 3D group convolutional networks. Our experiments suggest two main findings. First, it is a good practice to factorize 3D convolutions by separating channel interactions and spatiotemporal interactions as this leads to improved accuracy and lower computational cost. Second, 3D channel-separated convolutions provide a form of regularization, yielding lower training accuracy but higher test accuracy compared to 3D convolutions. These two empirical findings lead us to design an architecture -- Channel-Separated Convolutional Network (CSN) -- which is simple, efficient, yet accurate. On Sports1M, Kinetics, and Something-Something, our CSNs are comparable with or better than the state-of-the-art while being 2-3 times more efficient.

研究动机与目标

在不牺牲准确性的前提下，推动降低视频分类中 3D CNN 的计算量。
研究在 3D 组卷积中，通道交互与时空交互的作用。
提出对通道和时空处理进行因子分解的 CSNs，以提高效率和正则化。
在大规模视频数据集上将 CSNs 与最先进方法进行比较，以确立性能与效率提升。

提出的方法

引入 3D 通道分离网络，其中所有卷积（conv1 除外）要么是用于通道交互的 1x1x1，要么是用于局部时空交互的 depthwise 3x3x3。
将通道交互定义并量化为通过滤波器交互的通道对数量。
给出保持交互的 (ip-CSN) 与减少交互的 (ir-CSN) 瓶颈块及相应结构。
在 FLOPs、参数量和通道交互方面，将传统 3D 卷积与 CSN 变体进行比较。
在 Kinetics-400 上进行消融，研究块设计、深度与通道交互对准确性与正则化的影响。
在 Sports1M、Kinetics-400 和 Something-Something-v1 上评估 CSNs，包括从 Sports1M 的微调。

实验结果

研究问题

RQ1通道交互如何影响用于视频分类的 3D 组卷积网络的准确性？
RQ2将通道交互与时空交互分离是否能够在保持或提升准确性的同时减少计算？
RQ3保持交互与减少交互的 CSN 变体是否相较于标准 3D CNN 提供更有利的准确性与 FLOPs 的权衡？
RQ4通道分离架构是否具备正则化优势，可提高视频数据集的泛化能力？

主要发现

在保持通道交互的情况下，CSNs 能达到与最先进 3D CNNs 相媲美或更高的准确性，同时将 FLOPs 降低约 2–3 倍。
保持交互的 CSN（ip-CSN）保持通道交互，在更深的模型中始终优于交互减少的 CSN（ir-CSN）。
通道分离起到了正则化作用，与密集 3D 卷积相比，训练误差更高但测试误差更低。
基于瓶颈的（ir-CSN）设计在所研究的块设计中提供最佳的计算/准确性权衡。
在 Sports1M、Kinetics 和 Something-Something-v1 上，CSN 与此前方法相当或更好，并且显著更快。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。