QUICK REVIEW

[论文解读] Temporal Convolution for Real-time Keyword Spotting on Mobile Devices

Seungwoo Choi, Seokjun Seo|arXiv (Cornell University)|Apr 8, 2019

Speech Recognition and Synthesis参考文献 20被引用 42

一句话总结

论文提出 TC-ResNet，一种基于时间卷积的卷积神经网络，用于移动设备上的实时关键词检测，在 Google Speech Commands 数据集上实现高精度的同时获得大幅加速（最高可达 385 倍），并公开用于训练和基准测试的完整代码。

ABSTRACT

Keyword spotting (KWS) plays a critical role in enabling speech-based user interactions on smart devices. Recent developments in the field of deep learning have led to wide adoption of convolutional neural networks (CNNs) in KWS systems due to their exceptional accuracy and robustness. The main challenge faced by KWS systems is the trade-off between high accuracy and low latency. Unfortunately, there has been little quantitative analysis of the actual latency of KWS models on mobile devices. This is especially concerning since conventional convolution-based KWS approaches are known to require a large number of operations to attain an adequate level of performance. In this paper, we propose a temporal convolution for real-time KWS on mobile devices. Unlike most of the 2D convolution-based KWS approaches that require a deep architecture to fully capture both low- and high-frequency domains, we exploit temporal convolutions with a compact ResNet architecture. In Google Speech Command Dataset, we achieve more than extbf{385x} speedup on Google Pixel 1 and surpass the accuracy compared to the state-of-the-art model. In addition, we release the implementation of the proposed and the baseline models including an end-to-end pipeline for training models and evaluating them on mobile devices.

研究动机与目标

促使在移动设备上实现高精度、低延迟模型的实时关键词检测。
提出一个时序卷积架构（TC-ResNet），在降低计算量的同时保持或提高准确性。
展示与二维卷积基线相比，在移动硬件上实现显著的现实世界加速。
提供端到端的训练、评估与移动基准测试的完整流水线及公开代码。
定量分析时序卷积相对于传统二维卷积在延迟和准确性方面的影响。

提出的方法

通过将输入重塑为 t x 1 x f，将 MFCC 特征视为一维时序序列，并应用时序卷积。
采用基于 ResNet 的骨干（TC-ResNet），使用 m x 1 的卷积核（第一层 m=3，其他层 m=9），卷积不带偏置；使用带有可训练尺度/偏移的批归一化。
引入残差连接与维度匹配快捷连接；使用宽度乘子来生成 TC-ResNet8/14 变体。
在 Google Speech Commands 数据集上进行训练与评估，使用标准增强（噪声、随机平移）及 MFCC 特征（40 MFCC，30 ms 窗口，10 ms 步幅）。
在 Google Pixel 1 上进行基准测试，以测量实际推理时间，报告 FLOPs、参数量和延迟，以及准确率。

实验结果

研究问题

RQ1时序卷积能否在不牺牲准确性的前提下，降低移动端关键词检测的计算量和延迟？
RQ2在准确率、FLOPs、参数量和实际移动端推理时间方面，TC-ResNet 相较于二维卷积基线有何差异？
RQ3宽度乘子和网络深度对移动设备上关键词检测的准确性与延迟权衡有何影响？

主要发现

TC-ResNet8 在 Pixel 1 上达到 96.1% 的准确率，推理时间 1.1 ms，FLOPs 3.0M，参数 66K。
TC-ResNet8-1.5 达到 96.2% 的准确率，2.8 ms，6.6M FLOPs，145K 参数。
TC-ResNet14 达到 96.2% 的准确率，2.5 ms，6.1M FLOPs，137K 参数。
TC-ResNet14-1.5 达到 96.6% 的准确率，5.7 ms，13.4M FLOPs，305K 参数。
Compared to CNN-1, TC-ResNet8 提供 29x 加速及 5.4 个百分点的准确率提升。
Compared to DS-CNN-S/M/L, TC-ResNet8 分别提供 1.5x/4.7x/15.3x 的加速，分别带来 +1.7/+1.2/+0.7 个百分点的准确率提升。
TC-ResNet8 比 Res15 基线快 385x，且准确率提升 0.3 个百分点，凸显时序卷积的有效性。
一个参数匹配的 2D-ResNet8 变体 (2D-ResNet8) 比 TC-ResNet8 慢 9.2x，而一个池化变体 (2D-ResNet8-Pool) 更快但准确率下降 1.2 个百分点，且仍比 TC-ResNet8 慢 3.2x。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。