QUICK REVIEW

[論文レビュー] TSM: Temporal Shift Module for Efficient Video Understanding

Ji Lin, Chuang Gan|arXiv (Cornell University)|Nov 20, 2018

Human Pose and Action Recognition参考文献 68被引用数 166

ひとこと要約

TSM は、時間に沿って特徴チャネルをシフトする軽量な Temporal Shift Module を導入し、2D-CNN の複雑さで 3D-CNN 相当の精度を達成し、オンライン低遅延の動画タスクへ拡張。

ABSTRACT

The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost. Conventional 2D CNNs are computationally cheap but cannot capture temporal relationships; 3D CNN based methods can achieve good performance but are computationally intensive, making it expensive to deploy. In this paper, we propose a generic and effective Temporal Shift Module (TSM) that enjoys both high efficiency and high performance. Specifically, it can achieve the performance of 3D CNN but maintain 2D CNN's complexity. TSM shifts part of the channels along the temporal dimension; thus facilitate information exchanged among neighboring frames. It can be inserted into 2D CNNs to achieve temporal modeling at zero computation and zero parameters. We also extended TSM to online setting, which enables real-time low-latency online video recognition and video object detection. TSM is accurate and efficient: it ranks the first place on the Something-Something leaderboard upon publication; on Jetson Nano and Galaxy Note8, it achieves a low latency of 13ms and 35ms for online video recognition. The code is available at: https://github.com/mit-han-lab/temporal-shift-module.

研究の動機と目的

実世界の展開に向けて、精度と低計算コストを両立した効率的な動画理解を動機づける。
2D CNN に組み込んだ際に追加の計算やパラメータを発生させずに時刻情報をモデリングする機構を開発する。
オフラインの高精度とオンラインの低遅延の動画認識シナリオの両方に対応する。

提案手法

Temporal Shift Module (TSM) を提案し、隣接フレームから情報を混ぜるために時系列方向にチャネルの一部をシフトする。
残差ブランチ内に TSM を組み込み (residual shift) 現フレームの空間学習を保持しつつ時系列融合を可能にする。
オフラインの動画理解には双方向 TS M、オンラインのリアルタイム処理には単方向 TSM を採用。
データ移動と待機時間を最小化するために部分シフト（例：1/4 チャネル）を適用し、時系列モデリング能力を維持する。
バックボーンの 2D CNN と計算量およびパラメータを等しく保ち、エッジデバイス適用性を示すことでハードウェア効率を実証。

実験結果

リサーチクエスチョン

RQ1計算量やパラメータを追加せずに 2D CNN に時間情報を組み込むにはどうすればよいか？
RQ2チャネルの一部のみをシフトし、残差ブロックに埋め込むことが精度と効率にどのような影響を与えるか？
RQ3提案する TSM はエッジデバイスでオフラインの高精度とオンラインの低遅延動画理解の両方を実現できるか？

主な発見

TSM は追加の計算なしで、時間的に焦点を当てたデータセットで 2D CNN のベースラインを大幅に向上させる。
双方向の TSM は Something-Something データセットで最先端の結果を達成しつつ 2D CNN の効率を維持。
単方向 TSM はオンラインの低遅延動画認識を最小限のメモリとほぼ追加の遅延なしで実現。
TSM は 3D-CNN や他の効率的な動画モデルと比較して、優れた精度- FLOPs のトレードオフで強力なハードウェア効率を提供。
TSM はオンライン動画物体検出へ一般化し、2D ベースラインより mAP を改善しながら遅延はほとんどない。
Edge deployments show practical latency reductions on devices like Jetson Nano and Galaxy Note8.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。