QUICK REVIEW

[論文レビュー] Few-shot Action Recognition with Prototype-centered Attentive Learning

Xiatian Zhu, Antoine Toisoul|arXiv (Cornell University)|Jan 20, 2021

Human Pose and Action Recognition参考文献 38被引用数 36

ひとこと要約

PAL は prototype-centered contrastive loss と Hybrid Attentive Learning 機構を導入し、few-shot action recognition におけるデータ効率と外れ値/クラス間オーバーラップへの耐性を改善し、4 つのベンチマークで最先端の結果を達成する。

ABSTRACT

Few-shot action recognition aims to recognize action classes with few training samples. Most existing methods adopt a meta-learning approach with episodic training. In each episode, the few samples in a meta-training task are split into support and query sets. The former is used to build a classifier, which is then evaluated on the latter using a query-centered loss for model updating. There are however two major limitations: lack of data efficiency due to the query-centered only loss design and inability to deal with the support set outlying samples and inter-class distribution overlapping problems. In this paper, we overcome both limitations by proposing a new Prototype-centered Attentive Learning (PAL) model composed of two novel components. First, a prototype-centered contrastive learning loss is introduced to complement the conventional query-centered learning objective, in order to make full use of the limited training samples in each episode. Second, PAL further integrates a hybrid attentive learning mechanism that can minimize the negative impacts of outliers and promote class separation. Extensive experiments on four standard few-shot action benchmarks show that our method clearly outperforms previous state-of-the-art methods, with the improvement particularly significant (10+\%) on the most challenging fine-grained action recognition benchmark.

研究の動機と目的

Address data inefficiency and sensitivity to outliers in few-shot action recognition.
Leverage limited episode data by combining prototype-centered contrastive learning with query-/support-set attention.
Mitigate inter-class overlap and intra-class outliers through a hybrid attentive learning framework.
Demonstrate state-of-the-art performance on four standard few-shot action benchmarks, especially fine-grained datasets.

提案手法

Adopt a ProtoNet-based framework augmented with Prototype-centered Attentive Learning (PAL).
Introduce a hybrid attentive learning (HAL) module performing support-set self-attention and query-to-support cross-attention.
Define a prototype-centered contrastive loss to complement the conventional query-centered objective.
Compute per-class prototypes from contextually enriched support features and classify queries via cosine similarity to prototypes.
Pretrain the TSN-based feature embedding network on the full training set before episodic meta-training.
Train end-to-end in two stages: TSN pretraining, then meta-training with PAL (meta loss + prototype-centered contrastive loss).

実験結果

リサーチクエスチョン

RQ1Can prototype-centered contrastive learning improve data utilization in each episode compared to traditional query-centered losses?
RQ2Does hybrid attention on support and query samples reduce the impact of outliers and inter-class overlap in few-shot tasks?
RQ3How does PAL perform relative to state-of-the-art methods across coarse-grained and fine-grained action benchmarks?
RQ4What is the contribution of pretraining the embedding network to overall performance in few-shot action recognition?

主な発見

Method	Kinetics-100 1-shot	Kinetics-100 5-shot	Sth-Sth-100 1-shot	Sth-Sth-100 5-shot	HMDB51 1-shot	HMDB51 5-shot	UCF101 1-shot	UCF101 5-shot
Matching Net	53.3	74.6	-	-	-	-	-	-
MAML	54.2	75.3	-	-	-	-	-	-
ProtoNet++	64.5	77.9	33.6	43.0	-	-	-	-
TARN	64.8	78.5	-	-	-	-	-	-
TRN++	68.4	82.0	38.6	48.9	-	-	-	-
CMN	60.5	78.9	-	-	-	-	-	-
CMN++	65.4	78.8	34.4	43.8	-	-	-	-
OTAM	73.0	85.8	42.8	52.3	-	-	-	-
ARN	63.7	82.4	-	-	45.5	60.6	66.3	83.1
FEAT	74.0	86.5	45.3	61.2	60.4	75.2	83.9	94.5
PAL (Ours)	74.2	87.1	46.4	62.6	60.9	75.8	85.3	95.2

PAL は four benchmarks で最先端の結果を達成し、特に fine-grained Sth-Sth-100 dataset で 5-shot における改善が顕著（約 10% 改善）。
Hybrid Attentive Learning (HAL) と Prototype-centered Contrastive Learning (PCL) は pretrained baseline に対して補完的な利得を提供し、1-shot および 5-shot の精度を向上。
embedding ネットワークの pretraining は決定的で、強力な特徴表現を実現し episodic adaptation への依存を低減。
PAL は query-centered および prototype-centered の信号を組み合わせることで intra-class variation と inter-class overlap を低減。
PAL は OTAM および FEAT を難易度の高いデータセットで一貫して上回り、アウトライヤーおよびクラスオーバーラップへの頑健性を強調。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。