QUICK REVIEW

[Paper Review] Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?

Kensho Hara, Hirokatsu Kataoka|arXiv (Cornell University)|Nov 27, 2017

Human Pose and Action Recognition8 references114 citations

TL;DR

The paper investigates whether large-scale video data (Kinetics) enables training very deep 3D CNNs from scratch and whether such models outperform 2D CNNs pretrained on ImageNet on action recognition benchmarks. It finds that Kinetics supports deep 3D ResNets up to 152 layers, and that Kinetics-pretrained 3D models, especially ResNeXt-101, outperform several 2D baselines on UCF-101 and HMDB-51.

ABSTRACT

The purpose of this study is to determine whether current video datasets have sufficient data for training very deep convolutional neural networks (CNNs) with spatio-temporal three-dimensional (3D) kernels. Recently, the performance levels of 3D CNNs in the field of action recognition have improved significantly. However, to date, conventional research has only explored relatively shallow 3D architectures. We examine the architectures of various 3D CNNs from relatively shallow to very deep ones on current video datasets. Based on the results of those experiments, the following conclusions could be obtained: (i) ResNet-18 training resulted in significant overfitting for UCF-101, HMDB-51, and ActivityNet but not for Kinetics. (ii) The Kinetics dataset has sufficient data for training of deep 3D CNNs, and enables training of up to 152 ResNets layers, interestingly similar to 2D ResNets on ImageNet. ResNeXt-101 achieved 78.4% average accuracy on the Kinetics test set. (iii) Kinetics pretrained simple 3D architectures outperforms complex 2D architectures, and the pretrained ResNeXt-101 achieved 94.5% and 70.2% on UCF-101 and HMDB-51, respectively. The use of 2D CNNs trained on ImageNet has produced significant progress in various tasks in image. We believe that using deep 3D CNNs together with Kinetics will retrace the successful history of 2D CNNs and ImageNet, and stimulate advances in computer vision for videos. The codes and pretrained models used in this study are publicly available. https://github.com/kenshohara/3D-ResNets-PyTorch

Motivation & Objective

Assess whether current video datasets suffice for training deep 3D CNNs from scratch.
Determine the depth limit at which 3D CNNs trained on Kinetics saturate performance.
Evaluate transfer learning: Kinetics-pretrained 3D CNNs fine-tuned on UCF-101 and HMDB-51.
Compare deep 3D architectures (ResNet variants, WRN, ResNeXt, DenseNet) on Kinetics and downstream datasets.

Proposed method

Design and train a range of 3D ResNet-based architectures (ResNet-18, -34, -50, -101, -152, -200; including pre-activation, WRN, ResNeXt, DenseNet) with 3D convolutions.
Train from scratch on UCF-101, HMDB-51, ActivityNet, and Kinetics; analyze overfitting via train/validation losses.
Vary network depth on Kinetics to identify optimal depth (up to 200 layers).
Fine-tune Kinetics-pretrained 3D CNNs on UCF-101 and HMDB-51 (conv5_x and FC layer).
Compare with state-of-the-art methods (C3D, P3D, two-stream I3D, ST Multiplier Net, TSN).

Experimental results

Research questions

RQ1Can 3D CNNs be trained from scratch to high accuracy on current video datasets?
RQ2Does Kinetics support training of very deep 3D CNNs comparable to depth in 2D CNNs on ImageNet?
RQ3Do Kinetics-pretrained 3D CNNs transfer effectively to smaller action datasets like UCF-101 and HMDB-51?
RQ4Which 3D architectures (ResNet variants, WRN, ResNeXt, DenseNet) yield the best performance for 3D CNNs on Kinetics and downstream tasks?
RQ5How do deep 3D CNNs compare to 2D architectures pre-trained on ImageNet or other baselines on action recognition benchmarks?

Key findings

ResNet-18 overfits on UCF-101, HMDB-51, and ActivityNet but not on Kinetics.
Kinetics can train deep 3D CNNs up to 152 layers; ResNet-200 shows diminishing gains relative to 152, indicating overfitting beyond that depth.
On Kinetics, 3D architectures pretrained from scratch achieve competitive performance, with ResNeXt-101 (64f) achieving 78.4% average on the Kinetics test set.
ResNeXt-101 (64f) achieves 94.5% on UCF-101 and 70.2% on HMDB-51 when pretrained on Kinetics and fine-tuned, outperforming several 2D-based or shallower 3D baselines.
RGB-I3D and two-stream I3D pretrained on Kinetics remain strong baselines, with two-stream I3D achieving 78.2% average on Kinetics test in cited comparisons.
Kinetics-pretrained simple 3D architectures outperform complex 2D architectures on UCF-101 and HMDB-51; deeper 3D networks benefit transfer learning on smaller datasets.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.