QUICK REVIEW

[論文レビュー] Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification

Ali Diba, Mohsen Fayyaz|arXiv (Cornell University)|Nov 22, 2017

Human Pose and Action Recognition参考文献 30被引用数 187

ひとこと要約

この論文は Temporal 3D ConvNets (T3D) と Temporal Transition Layer (TTL) を導入し、マルチスケールの時間的ダイナミクスを捉え、DenseNet を DenseNet3D に拡張し、2D から 3D への supervision transfer により安定した重み初期化とデータが限られた場合の性能向上を実現する。 HMDB51 と UCF101 で最先端の結果を達成し、Kinetics で競争力のある結果を示す。

ABSTRACT

The work in this paper is driven by the question how to exploit the temporal cues available in videos for their accurate classification, and for human action recognition in particular? Thus far, the vision community has focused on spatio-temporal approaches with fixed temporal convolution kernel depths. We introduce a new temporal layer that models variable temporal convolution kernel depths. We embed this new temporal layer in our proposed 3D CNN. We extend the DenseNet architecture - which normally is 2D - with 3D filters and pooling kernels. We name our proposed video convolutional network `Temporal 3D ConvNet'~(T3D) and its new temporal layer `Temporal Transition Layer'~(TTL). Our experiments show that T3D outperforms the current state-of-the-art methods on the HMDB51, UCF101 and Kinetics datasets. The other issue in training 3D ConvNets is about training them from scratch with a huge labeled dataset to get a reasonable performance. So the knowledge learned in 2D ConvNets is completely ignored. Another contribution in this work is a simple and effective technique to transfer knowledge from a pre-trained 2D CNN to a randomly initialized 3D CNN for a stable weight initialization. This allows us to significantly reduce the number of training samples for 3D CNNs. Thus, by finetuning this network, we beat the performance of generic and recent methods in 3D CNNs, which were trained on large video datasets, e.g. Sports-1M, and finetuned on the target datasets, e.g. HMDB51/UCF101. The T3D codes will be released

研究の動機と目的

動画における時間情報を活用してアクション認識を向上させる動機づけ。
3D CNN 内で可変の時間的深さをモデル化するアーキテクチャの開発。
短期・中期・長期のダイナミクスを捉える新規 TTL を用いて DenseNet を DenseNet3D に拡張。
事前学習済みの 2D CNN からランダム初期化の 3D CNN へのクロスアーキテクチャ転移学習法を導入し、トレーニングの負担を緩和。
HMDB51、UCF101、Kinetics で性能と転移性を示す。

提案手法

3D 畳み込みフレームワーク内で複数の時間深さからの特徴を連結する Temporal Transition Layer (TTL) を導入。
densely connected ブロック全体で 3D フィルタとプーリングカーネルを使用して DenseNet を DenseNet3D に拡張。
TTL を DenseNet3D に組み込み、短期・中期・長期の時間的ダイナミクスを学習する Temporal 3D ConvNets (T3D) を形成。
画像-動画対応タスクを介して画像版事前学習済みの 2D CNN（ImageNet）からランダム初期化の 3D CNN への教師付き転移を提案。
Kinetics から T3D をゼロから学習し、ターゲットデータセット（UCF101、HMDB51）でファインチューニングする；RGB 入力のみを用いた他の 3D CNN と比較。
2D→3D の転移戦略が安定した重み初期化と、小規模データセットでのデータ効率の良い学習を提供することを示す。

実験結果

リサーチクエスチョン

RQ1固定カーネル深さなしで長距離の時間情報を 3D CNN が捕捉できるか。
RQ2時間的深さが可変のカーネルを持つ TTL が、固定深さの 3D 畳み込みよりアクション認識を改善するか。
RQ32D CNN が学んだ知識を 3D CNN に転移させ、大規模なラベル付き動画データセットの必要性を低減できるか。
RQ4T3D は HMDB51、UCF101、Kinetics で最先端の 3D ConvNets と比較してどうか。
RQ5どの入力設定（フレームレート、解像度）が 3D video アーキテクチャを最も支援するか。

主な発見

TTL を備えた T3D は HMDB51 および UCF101 で最先端の 3D ConvNets を上回り、Kinetics でも競争力がある。
2D で事前学習した CNN が教師として機能し、ランダム初期化の 3D CNN に安定した初期化を提供して効果的な転移学習を可能にし、大規模な動画データセットがなくても学習できる。
T3D with TTL は UCF101 でゼロから学習した場合、DenseNet3D および他の 3D アーキテクチャより高い精度を示す。
フレーム解像度とサンプリングレートは性能に大きく影響する；224x224 のフレームとストライド 2 は、小さなフレームや大きなストライドより良い結果をもたらす。
転移学習（2D→3D）は UCF101 および HMDB51 の性能を改善し、大規模な動画データセットで学習しターゲットでファインチューニングしたモデルと同等かそれ以上の性能を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。