QUICK REVIEW

[论文解读] Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading

Xinshuo Weng, Kris Kitani|arXiv (Cornell University)|May 4, 2019

Video Surveillance and Tracking Methods参考文献 48被引用 45

一句话总结

该论文提出一个两流深度3D CNN 唇读框架（I3D 前端，使用灰度视频和光流），在 ImageNet 和 Kinetics 上预训练，然后是 Bi-LSTM 后端，在 LRW 上实现词级唇读的最新性能，绝对提升为 5.3%。

ABSTRACT

We focus on the word-level visual lipreading, which requires recognizing the word being spoken, given only the video but not the audio. State-of-the-art methods explore the use of end-to-end neural networks, including a shallow (up to three layers) 3D convolutional neural network (CNN) + a deep 2D CNN (e.g., ResNet) as the front-end to extract visual features, and a recurrent neural network (e.g., bidirectional LSTM) as the back-end for classification. In this work, we propose to replace the shallow 3D CNNs + deep 2D CNNs front-end with recent successful deep 3D CNNs --- two-stream (i.e., grayscale video and optical flow streams) I3D. We evaluate different combinations of front-end and back-end modules with the grayscale video and optical flow inputs on the LRW dataset. The experiments show that, compared to the shallow 3D CNNs + deep 2D CNNs front-end, the deep 3D CNNs front-end with pre-training on the large-scale image and video datasets (e.g., ImageNet and Kinetics) can improve the classification accuracy. Also, we demonstrate that using the optical flow input alone can achieve comparable performance as using the grayscale video as input. Moreover, the two-stream network using both the grayscale video and optical flow inputs can further improve the performance. Overall, our two-stream I3D front-end with a Bi-LSTM back-end results in an absolute improvement of 5.3% over the previous art on the LRW dataset.

研究动机与目标

通过利用超过三层的深度3D CNN前端，推进词级视觉唇读。
研究在大规模数据集（ImageNet 和 Kinetics）上对深度3D CNN进行预训练对唇读的收益。
评估光流作为输入以及两流架构在唇读中的作用。
展示端到端可训练性以及相对于以往两阶段和浅前端方法的端到端性能提升。

提出的方法

使用一个两流的 I3D 前端（灰度视频和光流）来学习时空特征。
将 2D ImageNet 权重扩张为 3D，并执行两阶段预训练：ImageNet 扩张再在 Kinetics 上微调。
后端由两层 Bi-LSTM 组成，用于建模时间依赖并产生词分数。
端到端训练，使用 softmax 层输出词的概率。
与单流 I3D 和浅层3D CNN前端进行比较，以分离深度、预训练和两流输入的贡献。

实验结果

研究问题

RQ1深度3D CNN前端是否在唇读中优于浅层3D加深度2D前端？
RQ2两阶段预训练（ImageNet扩张 + Kinetics微调）是否提升唇读精度？
RQ3光流是否是可行的或互补的唇读输入？两流设置是否带来改进？
RQ4后端选择（Bi-LSTM 与 1D时间卷积网络）对词分类性能有何影响？

主要发现

方法	值 (%)	测试 (%)
Joon Son Chung 2016		61.10
Chung and Zisserman 2018		66.00
Chung et al. 2017		76.20
Themos Stafylakis 2017	78.95	78.77
Ours	84.11	84.07

两流 I3D 前端配合 Bi-LSTM 后端在 LRW 上达到 84.07% 的测试准确率，较现有最佳性能领先 5.3 个百分点。
两轮预训练（ImageNet扩张的3D权重 + Kinetics 微调）对深度3D前端的良好性能至关重要。
单独的光流与灰度视频表现相当，将两者结合可获得进一步提升。
在LRW上，深度3D CNN前端优于浅层3D+深度2D前端。
单流输入（灰度或光流）有效，但两流输入始终能提升结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。