QUICK REVIEW

[论文解读] More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation

Quanfu Fan, Chun-Fu Chen|arXiv (Cornell University)|Dec 2, 2019

Human Pose and Action Recognition被引用 90

一句话总结

简要结论：引入一个轻量级、内存高效的视频架构（bLVNet），采用双路“大-小”设计和紧凑的时间聚合模块（TAM），在不使用重型3D卷积的情况下建模时序关系，在 Something-Something 和 Moments-in-Time 上取得最先进的结果，同时降低 FLOPs 和内存。

ABSTRACT

Current state-of-the-art models for video action recognition are mostly based on expensive 3D ConvNets. This results in a need for large GPU clusters to train and evaluate such architectures. To address this problem, we present a lightweight and memory-friendly architecture for action recognition that performs on par with or better than current architectures by using only a fraction of resources. The proposed architecture is based on a combination of a deep subnet operating on low-resolution frames with a compact subnet operating on high-resolution frames, allowing for high efficiency and accuracy at the same time. We demonstrate that our approach achieves a reduction by $3\sim4$ times in FLOPs and $\sim2$ times in memory usage compared to the baseline. This enables training deeper models with more input frames under the same computational budget. To further obviate the need for large-scale 3D convolutions, a temporal aggregation module is proposed to model temporal dependencies in a video at very small additional computational costs. Our models achieve strong performance on several action recognition benchmarks including Kinetics, Something-Something and Moments-in-time. The code and models are available at https://github.com/IBM/bLVNet-TAM.

研究动机与目标

降低视频动作识别的计算成本和内存占用，同时不牺牲准确性。
在相同硬件预算下，允许使用更深的骨干网络和更多输入帧进行训练。
开发高效捕捉短期与长期时序依赖的时序聚合机制。
促进有效的时序建模，而不诉诸昂贵的3D卷积。

提出的方法

提出 Big-Little Video Net (bLVNet)：一个双路径网络，其中一个深度、容量大分支处理低分辨率帧（Big-Net），另一个紧凑分支处理高分辨率帧（Little-Net）。
在每一层融合两个分支，合并多尺度特征并实现比基线 TSN 变体更高的帧处理效率。
引入 Temporal Aggregation Module (TAM)：一个轻量级、可学习、基于深度可分离1x1卷积的模块，进行跨时间窗的通道加权聚合，以建模短期和长期依赖。
TAM 操作包括：(i) 1x1 深度卷积以学习通道权重，(ii) 特征图的时间移动/移位，(iii) 使用 ReLU 激活对时间窗内进行聚合。
该 TAM 设计为独立于空间卷积，几乎不增加参数和计算，并且可以与 2D 或 3D 骨干集成。）

实验结果

研究问题

RQ1两分支的 Big-Little 网络是否能在降低 FLOPs 和内存的同时实现与 3D CNN 基线相媲美或更优的动作识别准确率？
RQ2轻量级时间聚合模块（TAM）是否在双路径视频网络中超越局部融合，改善时序建模？
RQ3增加输入帧数对所提议的 bLVNet-TAM 架构的性能与效率有何影响？
RQ4与现有的时序移位方法相比，TAM 在对具有挑战性数据集（如 Something-Something）的时序建模上是否更有效？

主要发现

bLVNet-TAM 相对于强基线在 FLOPs 和内存显著更低的条件下实现了强劲性能，使在单个计算节点上能够使用更深的骨干网和更多输入帧。
时间聚合模块（TAM）对 Temporal Shift Module（TSM）提供明显提升，且与局部融合互补，提升 Something-Somethin g 的准确性。
在 Something-Somethin g 上，采用更深的骨干（bLResNet-101）和更多帧的 bLVNet-TAM 在仅 RGB 的设置中达到新的SOTA。
在 Moments-in-Time 上，该方法在 top-1 准确率上优于单流和集成基线。
在各基准上，更多输入帧通常提升 bLVNet-TAM 的性能，同时相对于基于 TSN 的架构，内存使用保持有利。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。