QUICK REVIEW

[논문 리뷰] Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?

Kensho Hara, Hirokatsu Kataoka|arXiv (Cornell University)|2017. 11. 27.

Human Pose and Action Recognition참고 문헌 8인용 수 114

한 줄 요약

본 논문은 대규모 비디오 데이터(Kinetics)가 처음부터 매우 깊은 3D CNN을 학습하는 데 유효한지와 이러한 모델이 Action Recognition 벤치마크에서 ImageNet에서 사전학습된 2D CNN보다 우수한지 여부를 조사한다. 결과적으로 Kinetics는 최대 152계층의 깊은 3D ResNet을 지원하며, 특히 ResNeXt-101이 UCF-101과 HMDB-51에서 여러 2D 기본 모델보다 우수하다는 것을 발견한다.

ABSTRACT

The purpose of this study is to determine whether current video datasets have sufficient data for training very deep convolutional neural networks (CNNs) with spatio-temporal three-dimensional (3D) kernels. Recently, the performance levels of 3D CNNs in the field of action recognition have improved significantly. However, to date, conventional research has only explored relatively shallow 3D architectures. We examine the architectures of various 3D CNNs from relatively shallow to very deep ones on current video datasets. Based on the results of those experiments, the following conclusions could be obtained: (i) ResNet-18 training resulted in significant overfitting for UCF-101, HMDB-51, and ActivityNet but not for Kinetics. (ii) The Kinetics dataset has sufficient data for training of deep 3D CNNs, and enables training of up to 152 ResNets layers, interestingly similar to 2D ResNets on ImageNet. ResNeXt-101 achieved 78.4% average accuracy on the Kinetics test set. (iii) Kinetics pretrained simple 3D architectures outperforms complex 2D architectures, and the pretrained ResNeXt-101 achieved 94.5% and 70.2% on UCF-101 and HMDB-51, respectively. The use of 2D CNNs trained on ImageNet has produced significant progress in various tasks in image. We believe that using deep 3D CNNs together with Kinetics will retrace the successful history of 2D CNNs and ImageNet, and stimulate advances in computer vision for videos. The codes and pretrained models used in this study are publicly available. https://github.com/kenshohara/3D-ResNets-PyTorch

연구 동기 및 목표

현재의 비디오 데이터셋이 처음부터 깊은 3D CNN을 학습하기에 충분한지 평가한다.
Kinetics에서 학습된 3D CNN의 성능이 포화되는 깊이 한계를 결정한다.
전이 학습 평가: Kinetics로 사전학습된 3D CNN을 UCF-101 및 HMDB-51에서 미세조정한다.
Kinetics 및 하류 데이터셋에서 깊은 3D 아키텍처(ResNet 변형, WRN, ResNeXt, DenseNet)를 비교한다.

제안 방법

3D 컨볼루션을 포함한 다양한 3D ResNet 기반 아키텍처(ResNet-18, -34, -50, -101, -152, -200; pre-activation, WRN, ResNeXt, DenseNet를 포함)를 설계하고 학습한다.
UCF-101, HMDB-51, ActivityNet 및 Kinetics에서 처음부터 학습시키고 train/validation 손실을 통해 과적합을 분석한다.
Kinetics에서 네트워크 깊이를 변화시켜 최적 깊이(최대 200층)를 식별한다.
Kinetics로 사전학습된 3D CNN을 UCF-101 및 HMDB-51에서 미세조정한다(conv5_x 및 FC 계층).
최첨단 방법(C3D, P3D, two-stream I3D, ST Multiplier Net, TSN)과 비교한다.

실험 결과

연구 질문

RQ1현재의 비디오 데이터셋에서 3D CNN을 처음부터 학습시켜 높은 정확도를 얻을 수 있는가?
RQ2Kinetics가 ImageNet의 2D CNN 깊이와 비견될 만큼 매우 깊은 3D CNN의 학습을 지원하는가?
RQ3Kinetics로 사전학습된 3D CNN이 UCF-101 및 HMDB-51 같은 더 작은 액션 데이터셋으로 효과적으로 전이되는가?
RQ4어떤 3D 아키텍처(ResNet 변형, WRN, ResNeXt, DenseNet)가 Kinetics와 하류 태스크에서 3D CNN의 최상의 성능을 내는가?
RQ5깊은 3D CNN이 Action Recognition 벤치마크에서 ImageNet에서 사전 학습된 2D 아키텍처나 다른 기준선과 어떻게 비교되는가?

주요 결과

ResNet-18은 UCF-101, HMDB-51, 및 ActivityNet에서 과적합하지만 Kinetics에서는 그렇지 않다.
Kinetics는 152층까지 깊은 3D CNN을 학습시킬 수 있으며, ResNet-200은 152에 비해 수익이 감소하여 그 깊이에서 과적합이 나타난다.
Kinetics에서 처음부터 학습된 3D 아키텍처가 경쟁력 있는 성능을 달성하며, ResNeXt-101 (64f)이 Kinetics 테스트 세트에서 평균 78.4%를 달성한다.
ResNeXt-101 (64f)이 Kinetics에서 사전학습되고 미세조정될 때 UCF-101에서 94.5%, HMDB-51에서 70.2%를 달성하여 여러 2D 기반 또는 더 얕은 3D 베이스라인보다 우수하다.
RGB-I3D와 Kinetics에서 사전학습된 two-stream I3D는 여전히 강력한 베이스라인이며, cited 비교에서 two-stream I3D가 Kinetics 테스트에서 평균 78.2%를 달성한다.
Kinetics로 사전학습된 간단한 3D 아키텍처가 UCF-101 및 HMDB-51에서 복잡한 2D 아키텍처보다 우수하며, 더 깊은 3D 네트워크가 더 작은 데이터셋에서 전이 학습에 이점을 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.