QUICK REVIEW

[논문 리뷰] Self-supervised Co-training for Video Representation Learning

Tengda Han, Weidi Xie|arXiv (Cornell University)|2020. 10. 19.

Human Pose and Action Recognition참고 문헌 72인용 수 273

한 줄 요약

이 논문은 RGB와 광류(view) 간의 양수를 교환하는 자기지도식 공동 학습 프레임워크인 CoCLR를 제시하여 대조 학습을 개선하고, 비디오 행동 인식 및 검색에서 더 빠른 학습 효율로 UberNCE에 근접한 성능을 달성한다.

ABSTRACT

The objective of this paper is visual-only self-supervised video representation learning. We make the following contributions: (i) we investigate the benefit of adding semantic-class positives to instance-based Info Noise Contrastive Estimation (InfoNCE) training, showing that this form of supervised contrastive learning leads to a clear improvement in performance; (ii) we propose a novel self-supervised co-training scheme to improve the popular infoNCE loss, exploiting the complementary information from different views, RGB streams and optical flow, of the same data source by using one view to obtain positive class samples for the other; (iii) we thoroughly evaluate the quality of the learnt representation on two different downstream tasks: action recognition and video retrieval. In both cases, the proposed approach demonstrates state-of-the-art or comparable performance with other self-supervised approaches, whilst being significantly more efficient to train, i.e. requiring far less training data to achieve similar performance.

연구 동기 및 목표

인스턴스 구별만이 자기감독 학습에 비디오 데이터를 최적으로 활용하는가를 조사한다.
시맨틱 클래스의 하드 포지티브가 대조적 비디오 표현을 개선할 수 있는지 평가한다.
보완적인 뷰(RGB 및 흐름) 간에 양수를 채굴하는 자기지도식 공동학습 체계(CoCLR)를 제안한다.
학습된 표현을 다운스트림 작업인 UCF101, HMDB51, Kinetics-400의 행동 인식 및 비디오 검색에서 평가한다.

제안 방법

InfoNCE baseline(인스턴스 구별)과 시맨틱 레이블을 사용하는 오라클 UberNCE를 비교한다.
교차 뷰 양수를 채굴하기 위해 CoCLR을 도입한다: 흐름(view)에서 상위-K 유사 클립을 사용해 RGB 학습을 보강하고, 그 반대의 경우도 마찬가지로 보강한다.
RGB 및 흐름 네트워크 간의 교대 최적화를 통해 점진적으로 표현을 개선한다.
두 단계 학습을 사용한다: (i) RGB와 흐름에 대한 독립적인 InfoNCE 사전 학습, (ii) 교차 뷰 양수를 이용한 교대 공동 학습.
선형 프로브와 검색을 통해 학습된 표현의 전달 가능성을 평가한다.

Figure 1: Two video clips of a golf-swing action and their corresponding optical flows. In this example, the flow patterns are very similar across different video instances despite significant variations in RGB space. This observation motivates the idea of co-training, which aims to gradually enhanc

실험 결과

연구 질문

RQ1시맨틱 클래스 양수(UberNCE)를 도입하는 것이 비디오 표현 학습에서 인스턴스만의 InfoNCE보다 개선되는가?
RQ2RGB와 광류 뷰 간의 공동 학습이 더 어려운 양수를 모으고 다운스트림 성능을 향상시킬 수 있는가?
RQ3CoCLR은 동작 인식 및 검색에서 단일 뷰 자기지도 방법 및 UberNCE와 어떻게 비교되는가?
RQ4상위-K 양수 채굴(K) 및 교대 사이클과 같은 하이퍼파라미터가 CoCLR 성능에 어떤 영향을 미치는가?

주요 결과

UberNCE가 InfoNCE를 능가하여 인스턴스 구분이 데이터 자원을 비효율적으로 사용할 수 있음을 보여준다.
CoCLR은 InfoNCE와 CMC를 크게 앞지르며 선형 프로브 행동 인식(RGB) 및 검색에서 UberNCE의 성능에 근접한다.
이중 스트림 CoCLR(RGB+Flow)이 결과를 더 개선하며, RGB 및 Flow 모델이 상호 보완적 이득을 제공한다.
End-to-end 미세 조정은 학습 스킴 간의 성능 차이를 줄이지만, 사전 학습 전이 시나리오에서 여전히CoCLR이 우수하다.
CoCLR은 UCF101 및 Kinetics-400에서 다른 자기지도 방법들과 비교해 최첨단 또는 유사한 결과를 보여주며, 더 높은 학습 효율과 더 적은 데이터가 필요하다.

Figure 3: Nearest neighbour retrieval results with CoCLR representations. The left side is the query video from the UCF101 testing set, and the right side are the top 3 nearest neighbours from the UCF101 training set. CoCLR is trained only on UCF101. The action label for each video is shown in the u

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.