QUICK REVIEW

[論文レビュー] ACM-Net: Action Context Modeling Network for Weakly-Supervised Temporal Action Localization

Sanqing Qu, Guang Chen|arXiv (Cornell University)|Apr 7, 2021

Human Pose and Action Recognition参考文献 53被引用数 44

ひとこと要約

ACM-Netは、ビデオレベルの監視下でアクションインスタンス・文脈・非アクション背景を分離する3分岐のアクション-コンテキストアテンション機構を導入し、THUMOS-14とActivityNet-1.3で最先端の弱教師付き時系列アクション局在を達成します。さらに、いくつかの完全教師あり手法と同等の性能を達成します。

ABSTRACT

Weakly-supervised temporal action localization aims to localize action instances temporal boundary and identify the corresponding action category with only video-level labels. Traditional methods mainly focus on foreground and background frames separation with only a single attention branch and class activation sequence. However, we argue that apart from the distinctive foreground and background frames there are plenty of semantically ambiguous action context frames. It does not make sense to group those context frames to the same background class since they are semantically related to a specific action category. Consequently, it is challenging to suppress action context frames with only a single class activation sequence. To address this issue, in this paper, we propose an action-context modeling network termed ACM-Net, which integrates a three-branch attention module to measure the likelihood of each temporal point being action instance, context, or non-action background, simultaneously. Then based on the obtained three-branch attention values, we construct three-branch class activation sequences to represent the action instances, contexts, and non-action backgrounds, individually. To evaluate the effectiveness of our ACM-Net, we conduct extensive experiments on two benchmark datasets, THUMOS-14 and ActivityNet-1.3. The experiments show that our method can outperform current state-of-the-art methods, and even achieve comparable performance with fully-supervised methods. Code can be found at https://github.com/ispc-lab/ACM-Net

研究の動機と目的

意味的に曖昧なアクションと文脈フレームの抑制を、単純な前景/背景分離を超えて改善する動機づけ。
CAS_ins、CAS_con、CAS_bakを作成する3分岐のクラス非依存アテンションモジュールを用いて、アクションインスタンス、文脈、背景を構築するACM-Netを提案。
動画レベルのラベルのみを用いたMILを三つの個別CASブランチで活用し、動画レベルのアクション分類を最適化。
補助損失（アテンションガイド、特徴分離、スパースアテンション）を組み込み、スニペットレベルの識別と局在を向上。
弱教師付きでTHUMOS-14とActivityNet-1.3において最先端または競争力のある性能を示す。

提案手法

動画を非重複スニペットに分割し、RGBとFlow特徴をF(t)として抽出する。
訓練可能な畳み込みベースの埋め込みを用いて特徴をXに埋め込む。
初期のClass Activation Sequence Phi = MLP(X)を計算する。
3分岐アテンションモジュールを適用し、softmax Conv(X)を介してatt_ins、att_con、att_bakを取得する。
CAS_ins = att_ins * CAS、CAS_con = att_con * CAS、CAS_bak = att_bak * CASを構築する。
MILを用いてtop-kスコアを集約し動画レベルのクラス確率p_ins, p_con, p_bakを得て対応するクロスエントロピー損失を計算する。
損失L_cls = L_cls_ins + L_cls_con + L_cls_bakを、補助損失L_gui, L_feat, L_spaと組み合わせて指導、特徴分離、稀疎性を促進する。
推論時にはp_insで分類し、CAS_insとatt_insを閾値付けしてアクションを局在させ、NMSを適用する；Outer-Inner-Contrastiveでスコアを精錬する。

実験結果

リサーチクエスチョン

RQ1弱い監視の下でアクション-文脈フレームをアクションインスタンスと非アクション背景から効果的に分離できるか？
RQ23分岐のアテンション機構は、前景-背景アプローチと比較してアクションインスタンス、文脈、背景の識別を改善するか？
RQ3補助損失はスニペットレベルの識別と全体の局在性能にどのように影響するか？
RQ4標準的な弱教師付きTALベンチマーク（THUMOS-14, ActivityNet-1.3）における本手法の性能向上は、従来法と比べてどうか？

主な発見

Dataset	Method	mAP@t-IoU 0.10	mAP@t-IoU 0.20	mAP@t-IoU 0.30	mAP@t-IoU 0.40	mAP@t-IoU 0.50	mAP@t-IoU 0.60	mAP@t-IoU 0.70	Avg[0.1-0.5]	Avg[0.3-0.7]	Avg
THUMOS-14	ACM-Net(Ours)	68.9	62.7	55.0	44.6	34.6	21.8	10.8	53.2	33.4	42.6
THUMOS-14	Others (selected)	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A
ActivityNet-1.3	ACM-Net(Ours)	40.1	-	-	-	-	-	-	24.6	-	-

THUMOS-14において、ACM-Netは0.1 IoUで68.9 mAP、0.2 IoUで62.7、0.3で55.0、0.4で44.6、0.5で34.6、0.6で21.8、0.7で10.8のmAPを達成し、従来の弱教師付き手法を上回る。
THUMOS-14において、ACM-Netは53.2 Avg[0.1-0.5]および33.4 Avg[0.3-0.7]、全体Avgは42.6。
ActivityNet-1.3では、IoU=0.50で40.1 mAP、IoU=0.75で24.2、IoU=0.95で6.2、Avgは24.6。
ACM-NetはActivityNet-1.3で弱教師付きTALの最先端手法を上回り、特定のIoU範囲で完全監視ベースのいくつかの手法と競合する性能を示す。
アブレーション分析は、3分岐アテンションとCAS構成が文脈フレームを抑制し局在精度を向上させるのに有効であることを示す。
定性的な可視化は、ACM-Netがアクションインスタンスを曖昧な文脈や背景から識別できる能力を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。