QUICK REVIEW

[論文レビュー] TEA: Temporal Excitation and Aggregation for Action Recognition

Yan Li, Bin Ji|arXiv (Cornell University)|Apr 3, 2020

Human Pose and Action Recognition参考文献 51被引用数 38

ひとこと要約

TEA は Motion Excitation (ME) モジュールと Multiple Temporal Aggregation (MTA) モジュールを導入し、効率的な短距離・長距離時相モデリングを実現。ResNet バックボーンに統合。

ABSTRACT

Temporal modeling is key for action recognition in videos. It normally considers both short-range motions and long-range aggregations. In this paper, we propose a Temporal Excitation and Aggregation (TEA) block, including a motion excitation (ME) module and a multiple temporal aggregation (MTA) module, specifically designed to capture both short- and long-range temporal evolution. In particular, for short-range motion modeling, the ME module calculates the feature-level temporal differences from spatiotemporal features. It then utilizes the differences to excite the motion-sensitive channels of the features. The long-range temporal aggregations in previous works are typically achieved by stacking a large number of local temporal convolutions. Each convolution processes a local temporal window at a time. In contrast, the MTA module proposes to deform the local convolution to a group of sub-convolutions, forming a hierarchical residual architecture. Without introducing additional parameters, the features will be processed with a series of sub-convolutions, and each frame could complete multiple temporal aggregations with neighborhoods. The final equivalent receptive field of temporal dimension is accordingly enlarged, which is capable of modeling the long-range temporal relationship over distant frames. The two components of the TEA block are complementary in temporal modeling. Finally, our approach achieves impressive results at low FLOPs on several action recognition benchmarks, such as Kinetics, Something-Something, HMDB51, and UCF101, which confirms its effectiveness and efficiency.

研究の動機と目的

動画におけるアクション認識のための、短距離・長距離の両方で堅牢な時相モデリングを動機づける。
モーション意識の特徴励起を時空学習と統合する。
追加パラメータなしで、効率的に時系列受容野を拡大する。
一般的なベンチマークで効率と有効性を示す。
ResNet アーキテクチャに挿入可能なモジュール式 TEA ブロックを提供する。

提案手法

Motion Excitation (ME) を提案し、特徴レベルの時相差分を計算して、モーション感度チャネルを残差接続で励起する。
Local temporal convolutions を、チャネルグループ間のサブ畳み込みの連鎖として変形させ、追加パラメータなしで時系列受容野を拡大するMultiple Temporal Aggregation (MTA) を提案する。
MEとMTAをResNetブロックに組み込み、TEAブロックを形成し、動画モデル用にTEAブロックを積み重ねる。
スパースな時相サンプリング（Tフレーム）と単純な時系列プーリングを用いた2D CNN バックボーン（ResNet-50）で動画レベルの予測を行う。
Something-Something V1、Kinetics-400、HMDB51、および UCF101 に対する 2D/2+1D ベースラインおよび従来の最先端と比較する。

実験結果

リサーチクエスチョン

RQ1明示的な光学フローを使用せずに、短距離の動きを時空特徴学習内でいかに効果的にエンコードできるか？
RQ2軽量モジュールが局所的な時相操作をカスケードして、長距離の時系列依存性を効率的に捉えられるか？
RQ3MEとMTAは互いに補完し合い、計算効率を保ちながらアクション認識性能を向上させるか？
RQ4標準ベンチマークで、TEAは従来の2D、2+1D、3D CNNアプローチと比較してどう性能を示すか？

主な発見

Method	Backbone	Frames × Crops × Clips	FLOPs	Pre-train	Top-1 (Val)	Top-5 (Val)	Test Top-1
TEA (Ours)	ResNet50	8 × 1 × 1	35G	ImgNet	48.9	78.1	-
TEA (Ours)	ResNet50	8 × 3 × 10	35G	ImgNet	75.0	91.8	-
TEA (Ours)	ResNet50	16 × 3 × 10	70G	ImgNet	76.1	92.5	-

TEA は Something-Something V1 で 8 frames and 1 crop（8x1x1 構成）で 48.9% Top-1 を達成。
TEA は (2+1)D ResNet および SENet のベースラインを上回り、ME が顕著な改善をもたらし、残差接続が静的シーン情報を保持する。
MTA の導入によりさらなる利得を得て、TEA の 8x1x1 で 48.9% Top-1、8x3x10 で 51.7%、16x3x10 で 52.3% Top-1 を Something-Something V1 の派生で達成。
On Something-Something V1, TEA with 8 frames and 1 crop reaches 48.9% Top-1 and 78.1% Top-5; with 8x3x10 it reaches 75.0% Top-1 and 91.8% Top-5; with 16x3x10 it reaches 76.1% Top-1 and 92.5% Top-5.
Compared to several state-of-the-art methods on Something-Something V1, TEA at 8x3x10 (75.0% Top-1) and 16x3x10 (76.1% Top-1) outperforms many 2D/2+1D baselines at similar FLOPs, and shows competitive performance against 3D CNN-based models.
On Kinetics-400, TEA with 16x3x10 achieves 76.1% Top-1, which is below SlowFast but competitive among efficient 2D/2+1D approaches.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。