QUICK REVIEW

[论文解读] LipNet: Sentence-level Lipreading.

Yannis Assael, Brendan Shillingford|arXiv (Cornell University)|Nov 5, 2016

Speech and Audio Processing参考文献 39被引用 112

一句话总结

LipNet 是首个端到端、与说话人无关的深度学习模型，用于句子级唇读，结合时空卷积与长短期记忆网络（LSTM）及连接时序分类（CTC）损失，直接将视频序列映射为文本。其在 GRID 语料库上达到 93.4% 的准确率，超越了人类唇读专家及先前的最先进方法。

ABSTRACT

Lipreading is the task of decoding text from the movement of a speaker's mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al., 2016; Chung & Zisserman, 2016a). All existing works, however, perform only word classification, not sentence-level sequence prediction. Studies have shown that human lipreading performance increases for longer words (Easton & Basala, 1982), indicating the importance of features capturing temporal context in an ambiguous communication channel. Motivated by this observation, we present LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, an LSTM recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end. To the best of our knowledge, LipNet is the first lipreading model to operate at sentence-level, using a single end-to-end speaker-independent deep model to simultaneously learn spatiotemporal visual features and a sequence model. On the GRID corpus, LipNet achieves 93.4% accuracy, outperforming experienced human lipreaders and the previous 79.6% state-of-the-art accuracy.

研究动机与目标

开发一种端到端深度学习模型，实现句子级唇读，突破以往基于词的分类方法。
通过循环建模利用时间上下文，提升在模糊视觉语音中的性能。
通过直接从视频帧中学习时空表征，消除对手工设计视觉特征的依赖。
使用单一统一架构实现与说话人无关的性能，端到端训练。
在 GRID 等基准数据集上超越现有方法及人类唇读专家。

提出的方法

LipNet 使用三维卷积神经网络（3D-CNN）从视频帧中提取时空特征，捕捉嘴部形状的空间信息与帧间的时间动态。
提取的特征由双向长短期记忆网络（LSTM）处理，以建模视觉特征序列中的长程依赖关系。
模型采用连接时序分类（CTC）损失进行端到端训练，实现可变长度视频输入与转录文本序列之间的对齐，无需显式帧级标注。
整个架构在原始视频帧上端到端训练，联合学习视觉表征与序列预测。
通过时间反向传播的随机梯度下降进行优化，实现空间与时间模式的联合学习。

实验结果

研究问题

RQ1端到端深度学习模型是否能在句子级唇读性能上超越以往基于词的模型？
RQ2通过 RNN 建模长程时间上下文是否能提升模糊视觉语音识别的性能？
RQ3单一与说话人无关的模型是否能在标准基准（如 GRID 语料库）上超越人类唇读专家？
RQ4是否可能消除手工设计的视觉特征，直接从视频帧中学习时空表征？

主要发现

LipNet 在 GRID 语料库上实现 93.4% 的词级准确率，显著超越此前最先进方法的 79.6%。
该模型性能超越了经验丰富的专业人类唇读专家，后者在相同基准上准确率约为 90%。
时空卷积的使用使模型能有效学习空间嘴部形态及其在帧间的时间演变。
双向 LSTM 的集成使模型能够捕捉视觉序列中的长程依赖关系，提升上下文感知能力。
采用 CTC 损失的端到端训练实现了视频输入与文本输出之间稳健的对齐，无需强制对齐。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。