QUICK REVIEW

[論文レビュー] More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation

Quanfu Fan, Chun-Fu Chen|arXiv (Cornell University)|Dec 2, 2019

Human Pose and Action Recognition被引用数 90

ひとこと要約

軽量でメモリ効率の良いビデオアーキテクチャ（bLVNet）を導入。デュアルパスのBig-Little設計とコンパクトな時間的集約モジュール（TAM）でheavyな3D畳み込みを使わずに時間的関係をモデル化し、Something-Somethingと Moments-in-Timeで最先端の結果を達成しつつFLOPsとメモリを削減。

ABSTRACT

Current state-of-the-art models for video action recognition are mostly based on expensive 3D ConvNets. This results in a need for large GPU clusters to train and evaluate such architectures. To address this problem, we present a lightweight and memory-friendly architecture for action recognition that performs on par with or better than current architectures by using only a fraction of resources. The proposed architecture is based on a combination of a deep subnet operating on low-resolution frames with a compact subnet operating on high-resolution frames, allowing for high efficiency and accuracy at the same time. We demonstrate that our approach achieves a reduction by $3\sim4$ times in FLOPs and $\sim2$ times in memory usage compared to the baseline. This enables training deeper models with more input frames under the same computational budget. To further obviate the need for large-scale 3D convolutions, a temporal aggregation module is proposed to model temporal dependencies in a video at very small additional computational costs. Our models achieve strong performance on several action recognition benchmarks including Kinetics, Something-Something and Moments-in-time. The code and models are available at https://github.com/IBM/bLVNet-TAM.

研究の動機と目的

ビデオアクション認識の計算コストとメモリフットプリントを削減する一方で精度を犠牲にしない。
同じハードウェア予算内で、より深いバックボーンとより多くの入力フレームでのトレーニングを可能にする。
短・長距離の時間的依存関係を効率的に捉える時間的集約機構を開発する。
高価な3D畳み込みを使うことなく効果的な時間的モデリングを促進する。

提案手法

Big-Little Video Net (bLVNet)を提案する：深く高容量なブランチが低解像度フレームを処理するデュアルパス網（Big-Net）、コンパクトなブランチが高解像度フレームを処理する（Little-Net）設計。
各層で二つのブランチを融合して多スケール特徴を統合し、ベースラインTSN変種よりも多くのフレームを効率的に処理できるようにする。
Temporal Aggregation Module (TAM)を導入：軽量で学習可能な、深さ方向の1x1畳み込みベースのモジュールで、時間窓を跨ぐチャネルごとの重み付き集約を行い、短・長距離の依存関係をモデル化する。
TAMの動作には、(i) チャンネル重みを学習する1x1のdepthwise畳み込み、(ii) 特徴マップの時間的シフト、(iii) ReLU活性化を伴う時間窓全体での集約が含まれる。
TAMは空間畳み込みに依存しないよう設計されており、パラメータと計算量をほとんど増やさず、2Dまたは3Dバックボーンと統合できる。

実験結果

リサーチクエスチョン

RQ1二分岐のBig-Littleネットワークが、FLOPsとメモリを削減しつつ3D CNNのベースラインと同等またはそれ以上の行動認識精度を達成できるか？
RQ2軽量な時間的集約モジュール（TAM）は、デュアルパスのビデオネットワークにおける局所融合を超えた時間的モデリングを改善するか？
RQ3提案されたbLVNet-TAMアーキテクチャにおいて、入力フレーム数を増やすと性能と効率はどう変化するか？
RQ4難易度の高いデータセット（例：Something-Something）で、既存のTemporal Shift MethodよりTAMは時間的モデリングにおいてより効果的か？

主な発見

bLVNet-TAMは、強力なベースラインよりもはるかに低いFLOPsとメモリで高い性能を達成し、単一計算ノード上でより深いバックボーンとより多くの入力フレームを可能にする。
Temporal Aggregation Module (TAM)はTemporal Shift Module (TSM)は明確な利得を提供し、局所融合を補完してSomething-Somethingの精度を向上させる。
Something-Somethingでは、より深いバックボーン（bLResNet-101）と多くのフレームを用いたbLVNet-TAMがRGBのみの設定で新たなSOTAを達成。
Moments-in-Timeでは、本手法がトップ1精度でシングルストリームおよびアンサンブルのベースラインを上回る。
ベンチマーク全体で、より多くの入力フレームは一般的にbLVNet-TAMの性能を向上させる一方、TSNベースのアーキテクチャと比較してメモリ使用量は有利なままである。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。