QUICK REVIEW

[論文レビュー] TemporalMaxer: Maximize Temporal Context with only Max Pooling for Temporal Action Localization

Tuan N. Tang, Kwonyoung Kim|arXiv (Cornell University)|Mar 16, 2023

Human Pose and Action Recognition被引用数 21

ひとこと要約

TemporalMaxer は、事前抽出された 3D-CNN 特徴から局所的な時系列情報を最大化する、単純でパラメータフリーの Max Pooling ブロックを用い、長期的 TCM 手法を TAL で上回る高速性とパラメータ数の少なさを実現します。

ABSTRACT

Temporal Action Localization (TAL) is a challenging task in video understanding that aims to identify and localize actions within a video sequence. Recent studies have emphasized the importance of applying long-term temporal context modeling (TCM) blocks to the extracted video clip features such as employing complex self-attention mechanisms. In this paper, we present the simplest method ever to address this task and argue that the extracted video clip features are already informative to achieve outstanding performance without sophisticated architectures. To this end, we introduce TemporalMaxer, which minimizes long-term temporal context modeling while maximizing information from the extracted video clip features with a basic, parameter-free, and local region operating max-pooling block. Picking out only the most critical information for adjacent and local clip embeddings, this block results in a more efficient TAL model. We demonstrate that TemporalMaxer outperforms other state-of-the-art methods that utilize long-term TCM such as self-attention on various TAL datasets while requiring significantly fewer parameters and computational resources. The code for our approach is publicly available at https://github.com/TuanTNG/TemporalMaxer

研究の動機と目的

重厚な長期的時系列文脈モデリング（TCM）の必要性を問うことで、時系列アクション局在（TAL）へのミニマリストなアプローチを動機づける。
単純な局所的な max-pooling ブロックと組み合わせた場合に、事前抽出された 3D-CNN 特徴が正確な TAL に必要な情報を含んでいるかを調査する。
高価なアテンションベースの TCM ブロックを置換する、パラメーターフリーの局所的文脈モジュールとして TemporalMaxer を開発する。
標準的な TAL ベンチマーク全体で TemporalMaxer を評価し、Transformer やグラフベースの長期 TCM 手法と比較した精度と推論速度を比較する。

提案手法

事前学習済み 3D-CNN からクリップ特徴を抽出してシーケンス X を形成する。
2つの1D畳み込み投影とピラミッドレベル間の L-1 TemporalMaxer ブロック（ストライド 2 の最大プーリング）を用いて、マルチスケールの時間的特徴ピラミッド Z を構築する。
分類と回帰ブランチからなる軽量ヘッドでデコードし、ピラッドレベルを跨いで共有する。
Focal Classification Loss と DIoU Regression Loss を組み合わせたマルチタスク損失で訓練し、全レベルに適用するとともに正サンプルの指標を用いる。
TCM ブロックのカーネルサイズを固定で 3 のままにする；アブレーションで Conv、サブサンプリング、平均プーリング、Transformer と比較する。
max-pooling 操作が識別可能な局所情報を保持しつつ深いネットワークのレセプティブフィールドを活用する、シンプルで非パラメトリックなバックボーンを目指す。

実験結果

リサーチクエスチョン

RQ1高品質な事前抽出特徴を使用した場合、パラメーターフリーの Max Pooling ベースの TCM ブロックは TAL の時系列文脈を最大化するのに十分か？
RQ2TemporalMaxer は Transformer/グラフベースの長期 TCM 手法と比較して、はるかに少ないパラメータと低い計算コストで競争力のある、あるいは優れた TAL 性能を達成できるか？
RQ3標準的な TAL データセット（THUMOS14、EPIC-Kitchens 100、MultiTHUMOS、MUSES）における TemporalMaxer の性能は、最先端のベースラインと比べてどうか？
RQ4Max Pooling TCM ブロックの異なるカーネルサイズが TAL の性能と効率に与える影響はどの程度か？

主な発見

モデル	特徴量	0.3	0.4	0.5	0.6	0.7	平均	時間（ms）
ActionFormer [60]	I3D [7]	82.1	77.8	71.0	59.4	43.9	66.8	80
Our (TemporalMaxer)	I3D [7]	82.8	78.9	71.8	60.5	44.7	67.7	50

TemporalMaxer は THUMOS14 で tIoU 閾値の平均で 67.7 mAP を達成し、長期 TCM 手法を含む従来手法を上回る。
TemporalMaxer はバックボーン計算を削減し、推論を高速化する。例：THUMOS14 で 1 本あたり 50 ms、ActionFormer ベースラインよりもコストが高い。
EPIC-Kitchens 100 で、TemporalMaxer は動詞の平均 mAP が 24.5%、名詞が 22.8% を達成し、ActionFormer ベースラインをそれぞれ約 1.0%、0.9% 上回る。
MUSES では TemporalMaxer が平均 27.2 mAP に達し、以前の長期 TCM 手法を上回る。
MultiTHUMOS では TemporalMaxer が平均 29.9% mAP を達成し、PointTAD および ActionFormer ベースラインを顕著に上回る。
アブレーション研究は、Max Pooling が Conv、Subsampling、Average Pooling を TCM ブロックとして上回り、カーネルサイズ 3 が最高の性能を引き出し、効率も高いことを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。