QUICK REVIEW

[論文レビュー] Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading

Xinshuo Weng, Kris Kitani|arXiv (Cornell University)|May 4, 2019

Video Surveillance and Tracking Methods参考文献 48被引用数 45

ひとこと要約

この論文は、grayscale video と光学フローを用いた二流の深層3D CNNリップリーディングフレームワーク（I3DフロントエンドをImageNetおよびKineticsで事前学習し、Bi-LSTMバックエンドを併用）を提案し、LRWにおける語レベルのリップリーディングで最先端を達成。絶対的に5.3パーセンテージポイントの改善を示す。

ABSTRACT

We focus on the word-level visual lipreading, which requires recognizing the word being spoken, given only the video but not the audio. State-of-the-art methods explore the use of end-to-end neural networks, including a shallow (up to three layers) 3D convolutional neural network (CNN) + a deep 2D CNN (e.g., ResNet) as the front-end to extract visual features, and a recurrent neural network (e.g., bidirectional LSTM) as the back-end for classification. In this work, we propose to replace the shallow 3D CNNs + deep 2D CNNs front-end with recent successful deep 3D CNNs --- two-stream (i.e., grayscale video and optical flow streams) I3D. We evaluate different combinations of front-end and back-end modules with the grayscale video and optical flow inputs on the LRW dataset. The experiments show that, compared to the shallow 3D CNNs + deep 2D CNNs front-end, the deep 3D CNNs front-end with pre-training on the large-scale image and video datasets (e.g., ImageNet and Kinetics) can improve the classification accuracy. Also, we demonstrate that using the optical flow input alone can achieve comparable performance as using the grayscale video as input. Moreover, the two-stream network using both the grayscale video and optical flow inputs can further improve the performance. Overall, our two-stream I3D front-end with a Bi-LSTM back-end results in an absolute improvement of 5.3% over the previous art on the LRW dataset.

研究の動機と目的

深い3D CNNフロントエンドを活用して語レベルの視覚リップリーディングを前進させる。
大規模データセット（ImageNetおよびKinetics）での深い3D CNNの事前学習の利点を検討する。
入力として光学フローを用いることと二流アーキテクチャの有用性を評価する。
エンドツーエンドの訓練可能性と、従来の2段階・浅いフロントエンド手法に対するエンドツーエンドの性能向上を実証する。

提案手法

二流のI3Dフロントエンド（グレースケール動画と光学フロー）を用いて空間-時間特徴を学習する。
2D ImageNetウェイトを3Dへ膨張させ、2段階の事前学習を行う：ImageNet膨張後にKineticsでファインチューニング。
バックエンドは時系列依存性をモデル化し語スコアを出力する2層のBi-LSTMで構成。
語確率のソフトマックス層を用いてエンドツーエンド訓練を行う。
深い3Dフロントエンドの貢献を単一ストリームI3Dおよび浅い3D CNNフロントエンドと比較して isolation する。

実験結果

リサーチクエスチョン

RQ1深い3D CNNフロントエンドは、リップリーディングにおいて浅い3D＋深い2Dフロントエンドより優れているか。
RQ22段階の事前学習（ImageNet膨張 + Kineticsファインチューニング）はリップリーディングの精度を向上させるか。
RQ3光学フローはリップリーディングにとって現実的な入力または補完的な入力となるか、そして二流設定は改善をもたらすか。
RQ4バックエンドの選択（Bi-LSTM対1D時系列畳み込みネット）は語分類性能にどのような影響を与えるか。

主な発見

方法	値（%）	テスト（%）
Joon Son Chung 2016		61.10
Chung and Zisserman 2018		66.00
Chung et al. 2017		76.20
Themos Stafylakis 2017	78.95	78.77
Ours	84.11	84.07

二流I3DフロントエンドとBi-LSTMバックエンドはLRWでテスト精度84.07%を達成し、従来の最先端を5.3ポイント上回る。
深い3Dフロントエンドの良好な性能には、ImageNet膨張の2回の事前学習（ImageNet-inflated 3D weights + Kineticsファインチューニング）が不可欠である。
光学フローのみはグレースケール動画と同等の性能を示し、両ストリームを組み合わせるとさらに改善される。
深い3D CNNフロントエンドはLRWにおいて浅い3D＋深い2Dフロントエンドより優れている。
単一ストリーム入力（グレースケールまたはフロー）も有効だが、二流入力は一貫して結果を改善する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。