QUICK REVIEW

[논문 리뷰] Self-supervised Spatiotemporal Feature Learning by Video Geometric Transformations

Longlong Jing, Yingli Tian|arXiv (Cornell University)|2018. 11. 28.

Human Pose and Action Recognition참고 문헌 35인용 수 76

한 줄 요약

이 논문은 인간 레이블이 필요 없는 자기지도 학습 프레임워크를 제안하며, 0°, 90°, 180°, 270° 회전과 같은 기하학적 변환을 선구 과제로 사용하여 시공간 영상 특징을 학습한다. 이 방법은 인간 레이블이 없는 데이터를 사용함에도 불구하고, UCF101에서 20.4% 향상되고 HMDB51에서 16.7% 향상된 상태에서 각각 62.9% 및 33.7%의 top-1 정확도를 달성하여 최신 기술 수준을 확립한다.

ABSTRACT

To alleviate the expensive cost of data collection and annotation, many self-supervised learning methods were proposed to learn image representations without human-labeled annotations. However, self-supervised learning for video representations is not yet well-addressed. In this paper, we propose a novel 3DConvNet-based fully self-supervised framework to learn spatiotemporal video features without using any human-labeled annotations. First, a set of pre-designed geometric transformations (e.g. rotating 0 degree, 90 degrees, 180 degrees, and 270 degrees) are applied to each video. Then a pretext task can be defined as recognizing the pre-designed geometric transformations. Therefore, the spatiotemporal video features can be learned in the process of accomplishing this pretext task without using human-labeled annotations. The learned spatiotemporal video representations can further be employed as pretrained features for different video-related applications. The proposed geometric transformations (e.g. rotations) are proved to be effective to learn representative spatiotemporal features in our 3DConvNet-based fully self-supervised framework. With the pre-trained spatiotemporal features from two large video datasets, the performance of action recognition is significantly boosted up by 20.4% on UCF101 dataset and 16.7% on HMDB51 dataset respectively compared to that from the model trained from scratch. Furthermore, our framework outperforms the state-of-the-arts of fully self-supervised methods on both UCF101 and HMDB51 datasets and achieves 62.9% and 33.7% accuracy respectively.

연구 동기 및 목표

영상 데이터 레이블링의 높은 비용을 해결하기 위해 시공간 영상 특징을 자기지도 학습 방식으로 학습하는 것.
영상 표현 학습에 인간 레이블에 의존하지 않는 완전한 자기지도 학습 프레임워크를 개발하는 것.
기하학적 변환 선구 과제를 통해 유도된 사전 훈련된 특징을 사용하여 행동 인식 성능을 향상시키는 것.
기하학적 변환이 영상에서 의미 있는 시공간 특징을 학습하는 데 효과적인 감독 신호로 기능할 수 있음을 입증하는 것.

제안 방법

입력 영상 클립에 사전 정의된 기하학적 변환—0°, 90°, 180°, 270° 회전—을 적용한다.
3DConvNet은 선구 과제로 적용된 기하학적 변환을 예측하도록 훈련되며, 이 과정에서 시공간 특징을 학습한다.
이 프레임워크는 인간 레이블이 전혀 없이, 오직 변환 예측 과제에 의존하여 엔드 투 엔드로 훈련된다.
학습된 특징들은 행동 인식과 같은 후행 영상 분류 과제에 대해 미세조정된다.
기하학적 변환에 의해 유도되는 공간 및 시간 불변성을 활용하여 강력한 영상 표현을 학습한다.
일반화 능력과 성능 평가를 위해 두 개의 대규모 영상 데이터셋에서 이 방법을 평가한다.

실험 결과

연구 질문

RQ1기하학적 변환이 자기지도 영상 표현 학습에 효과적인 감독 신호로 기능할 수 있는가?
RQ2인간 레이블 없이도 3DConvNet이 기하학적 변환 예측 선구 과제를 통해 시공간 특징을 얼마나 잘 학습할 수 있는가?
RQ3이 방법을 통해 사전 훈련을 수행할 경우, 훈련을 처음부터 시작한 모델 대비 후행 행동 인식 성능 향상 정도는 어떠한가?
RQ4기본 벤치마크에서 최신 기술 수준의 완전한 자기지도 영상 학습 방법과 비교해 이 프레임워크는 어떻게 성능을 내는가?

주요 결과

제안된 방법은 UCF101 데이터셋에서 62.9%의 top-1 정확도를 달성하여 최신 기술 수준의 완전한 자기지도 학습 방법들을 능가한다.
HMDB51 데이터셋에서는 33.7%의 top-1 정확도를 기록하며, 완전한 자기지도 영상 학습 분야에서 새로운 최신 기술 수준을 수립한다.
기하학적 변환 선구 과제를 통한 사전 훈련은 UCF101에서 훈련을 처음부터 시작한 경우 대비 행동 인식 정확도를 20.4% 향상시킨다.
HMDB51에서 이 방법은 사전 훈련 없이 훈련된 모델 대비 행동 인식 성능을 16.7% 향상시킨다.
인간 레이블이 없는 환경에서 회전과 같은 기하학적 변환이 표현 가능한 시공간 특징을 학습하는 데 효과적이다.
이 프레임워크는 다양한 데이터셋에 대해 잘 일반화되며, 자기지도 학습 신호의 강건성을 입증한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.