QUICK REVIEW

[논문 리뷰] TARN: Temporal Attentive Relation Network for Few-Shot and Zero-Shot Action Recognition

Mina Bishay, Georgios Zoumpourlis|arXiv (Cornell University)|2019. 07. 21.

Human Pose and Action Recognition인용 수 84

한 줄 요약

TARN은 few-shot 및 zero-shot 액션 인식을 위한 시계열 주의 관계 네트워크를 도입한다. 비디오 세그먼트를 정렬하기 위한 세그먼트 수준 주의(attention)를 사용하고 비디오 매칭을 위한 깊은 메트릭을 학습하여, fine-tuning이나 추가 메모리 모듈 없이 FSL에서 최첨단의 결과를 달성하고 ZSL에서 경쟁력 있는 결과를 보인다.

ABSTRACT

In this paper we propose a novel Temporal Attentive Relation Network (TARN) for the problems of few-shot and zero-shot action recognition. At the heart of our network is a meta-learning approach that learns to compare representations of variable temporal length, that is, either two videos of different length (in the case of few-shot action recognition) or a video and a semantic representation such as word vector (in the case of zero-shot action recognition). By contrast to other works in few-shot and zero-shot action recognition, we a) utilise attention mechanisms so as to perform temporal alignment, and b) learn a deep-distance measure on the aligned representations at video segment level. We adopt an episode-based training scheme and train our network in an end-to-end manner. The proposed method does not require any fine-tuning in the target domain or maintaining additional representations as is the case of memory networks. Experimental results show that the proposed architecture outperforms the state of the art in few-shot action recognition, and achieves competitive results in zero-shot action recognition.

연구 동기 및 목표

전체 비디오가 아닌 비디오 세그먼트를 비교하여 few-shot 액션 인식을 다룬다.
비디오 세그먼트를 시맨틱 클래스 표현과 연관시켜 zero-shot 액션 인식으로 확장한다.
메모리 네트워크나 대상 도메인에 대한 파인 튜닝이 필요 없는 엔드-투-엔드 학습 가능한 아키텍처를 개발한다.

제안 방법

임베딩 모듈은 C3D 특징으로 비디오 세그먼트를 처리한 후 양방향 GRU를 통해 세그먼트 임베딩을 생성한다.
Relation 모듈은 세그먼트별 주의(attention)를 적용하여 샘플과 쿼리 세그먼트를 정렬하고 표현을 동일한 세그먼트 길이로 변환한다.
세그먼트별 비교가 깊은 메트릭 학습 네트워크에 입력되어 각 비디오 쌍에 대한 관계 점수를 산출한다.
관계 점수에 대한 소프트맥스가 클래스 확률을 산출한다; K-shot의 경우 점수는 클래스별로 평균된다.

실험 결과

연구 질문

RQ1세그먼트 수준의 주의가 few-shot 액션 인식에서 시간적 정렬과 매칭을 개선할 수 있는가?
RQ2학습된 깊은 거리 메트릭을 사용한 세그먼트 단위 비교가 FSL에서 전체 비디오 또는 고정 거리 기반 접근법보다 우수한가?
RQ3이 프레임워크가 비디오 세그먼트와의 정렬 대상으로 시맨틱 벡터를 사용하는 zero-shot 액션 인식으로 확장될 수 있는가?

주요 결과

세그먼트별 주의 및 깊은 메트릭 학습이 결합된 TARN은 1-shot에서 5-shot 설정에 걸쳐 few-shot 액션 인식에서 최첨단을 능가한다.
비교 계층에서 유사도 척도로 EucCos를 사용하는 것이 실험된 옵션 중 최상의 결과를 낸다.
주의 기반의 다세그먼트 비교가 데이터셋과 특징 유형에 걸쳐 단일 벡터 기반 기준선(TARN-single)보다 우수하다.
제로샷 설정에서 TARN은 특히 UCF-101 분할에서 경쟁력 있는 결과를 달성하며, 다세그먼트 대 속성 비교가 최상의 성능을 제공한다.
본 프레임워크에서 C3D 기반 특징은 일반적으로 few-shot 액션 인식에서 ResNet-50 특징보다 더 우수하다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.