QUICK REVIEW

[논문 리뷰] Weakly Supervised Action Localization by Sparse Temporal Pooling Network

Phuc Nguyen, Ting Liu|arXiv (Cornell University)|2017. 12. 14.

Human Pose and Action Recognition참고 문헌 46인용 수 53

한 줄 요약

이 논문은 Sparse Temporal Pooling Network(STPN)를 제안하며, 비 trimming 비디오에서 비디오 수준 라벨과 희소성 기반 어텐션 메커니즘을 이용해 Temporal Class Activation Maps(T-CAMs)를 통해 시간 제안을 생성하는 약지도 학습 방식이다.

ABSTRACT

We propose a weakly supervised temporal action localization algorithm on untrimmed videos using convolutional neural networks. Our algorithm learns from video-level class labels and predicts temporal intervals of human actions with no requirement of temporal localization annotations. We design our network to identify a sparse subset of key segments associated with target actions in a video using an attention module and fuse the key segments through adaptive temporal pooling. Our loss function is comprised of two terms that minimize the video-level action classification error and enforce the sparsity of the segment selection. At inference time, we extract and score temporal proposals using temporal class activations and class-agnostic attentions to estimate the time intervals that correspond to target actions. The proposed algorithm attains state-of-the-art results on the THUMOS14 dataset and outstanding performance on ActivityNet1.3 even with its weak supervision.

연구 동기 및 목표

비디오 수준 라벨만으로 비trimmed 비디오에서 행동 로컬라이즈 학습의 동기를 제공한다.
행동 인식을 위해 주요 비디오 구간의 희소한 하위 집합을 선택하는 네트워크를 개발한다.
클래스 비특이적(attention)과 시간적 클래스 활성화를 융합해 행동 구간을 제안한다.

제안 방법

RGB와 플로우의 Two-stream I3D 피처 추출기를 사용해 비디오 구간을 표현한다(두 피처 모두 Kinetics에서 사전학습).
어텐션 모듈이 구간 수준 가중치를 생성하고, 희소성 손실이 구간의 희소한 선택을 강제한다.
어텐션 가중치를 적용한 구간 피처의 시간풀링으로 비디오 수준 분류를 수행한다.
각 클래스에 대한 Temporal Class Activation Maps(T-CAMs)를 계산해 1차원 시간 제안을 형성한다.
가중된 T-CAMs는 RGB와 플로우를 융합 파라미터 alpha로 조합해 제안을 점수화한다.
각 클래스별로 시간 제안에 대해 비최대억제(NMS)를 적용한다.

실험 결과

연구 질문

RQ1비디오 수준 라벨만으로 비trimmed 비디오에서 행동을 정확히 로컬라이즈할 수 있는가?
RQ2구간 선택에서의 희소성 강제가 약지도 행동 로컬라이제이션을 향상시키는가?
RQ3클래스 비특이적 어텐션과 결합된 T-CAM이 행동 구간 제안에 얼마나 효과적인가?
RQ4RGB, 플로우 또는 이 둘의 조합이 제안 점수에 어떤 영향을 미치는가?

주요 결과

방법	AP@IoU=0.1	AP@IoU=0.2	AP@IoU=0.3	AP@IoU=0.4	AP@IoU=0.5	AP@IoU=0.6	AP@IoU=0.7	AP@IoU=0.8	AP@IoU=0.9
STPN	52.0	44.7	35.5	25.8	16.9	9.9	4.3	1.2	0.1
STPN with UntrimmedNet features	45.3	38.8	31.1	23.5	16.2	9.8	5.1	2.0	0.3

STPN은 약지도 방법들 중 THUMOS14에서 최첨단 성능을 달성한다.
THUMOS14에서 UntrimmedNet 피처를 사용한 STPN은 기존의 약지도 접근법을 능가한다.
ActivityNet1.3에서 STPN은 약지도 성능이 경쟁력 있으며 특정 설정에서 일부 완전 지도 베이스라인을 능가한다.
츄후 연구에서 어텐션 메커니즘과 희소성 손실 모두 성능을 크게 향상시키는 것으로 나타난다.
두 스트림(RGB+flow) 피처가 단일 모달리티보다 우수하며, 특히 로컬라이제이션에 대해 flow가 더 강한 단서를 제공한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.