[论文解读] TinySpeech: Attention Condensers for Deep Speech Recognition Neural Networks on Edge Devices
本文提出注意力压缩器——一种自包含、独立的自注意力模块,可学习捕捉局部与跨通道激活关系的压缩表征,从而实现高效的设备端语音识别。基于这些模块并经由机器驱动设计优化的TinySpeech网络,在保持Google语音命令数据集高准确率的同时,参数量最多减少507倍、浮点运算量(FLOPs)减少48倍、权重内存降低2028倍。
Advances in deep learning have led to state-of-the-art performance across a multitude of speech recognition tasks. Nevertheless, the widespread deployment of deep neural networks for on-device speech recognition remains a challenge, particularly in edge scenarios where the memory and computing resources are highly constrained (e.g., low-power embedded devices) or where the memory and computing budget dedicated to speech recognition is low (e.g., mobile devices performing numerous tasks besides speech recognition). In this study, we introduce the concept of attention condensers for building low-footprint, highly-efficient deep neural networks for on-device speech recognition on the edge. An attention condenser is a self-attention mechanism that learns and produces a condensed embedding characterizing joint local and cross-channel activation relationships, and performs selective attention accordingly. To illustrate its efficacy, we introduce TinySpeech, low-precision deep neural networks comprising largely of attention condensers tailored for on-device speech recognition using a machine-driven design exploration strategy, with one tailored specifically with microcontroller operation constraints. Experimental results on the Google Speech Commands benchmark dataset for limited-vocabulary speech recognition showed that TinySpeech networks achieved significantly lower architectural complexity (as much as $507 imes$ fewer parameters), lower computational complexity (as much as $48 imes$ fewer multiply-add operations), and lower storage requirements (as much as $2028 imes$ lower weight memory requirements) when compared to previous work. These results not only demonstrate the efficacy of attention condensers for building highly efficient networks for on-device speech recognition, but also illuminate its potential for accelerating deep learning on the edge and empowering TinyML applications.
研究动机与目标
- 为在资源受限的边缘环境(如低功耗嵌入式系统和内存与计算预算有限的移动设备)中部署深度神经网络用于设备端语音识别,解决其挑战。
- 通过引入一种新型基于注意力的设计模式,克服现有基于卷积神经网络(CNN)架构的复杂性限制,减少对大型卷积模块的依赖。
- 利用机器驱动的设计探索策略,开发高度高效、低精度的深度神经网络,专用于有限词汇量的语音识别。
- 通过最小化架构与计算复杂度,实现在边缘设备上实时、隐私保护、无需云依赖的语音识别,同时不牺牲准确性。
提出的方法
- 提出注意力压缩器作为自包含、独立的模块,学习表示联合局部与跨通道激活关系的压缩表征。
- 设计注意力压缩器以实现选择性注意力,重点聚焦于强激活附近的激活,从而提升效率与表征质量。
- 将注意力压缩器集成到深层神经网络架构中,稀疏使用大型卷积模块,频繁使用注意力压缩器,以降低整体复杂度。
- 应用机器驱动的设计探索策略,对网络架构、超参数及精度(如量化)进行优化,以实现最小化模型体积与高准确率。
- 在Google语音命令基准上训练并评估TinySpeech网络,重点关注低精度推理与边缘部署约束。
- 专门针对微控制器运行优化一个变体TinySpeech-M,在设计阶段施加严格的内存与计算限制。
实验结果
研究问题
- RQ1注意力压缩器是否能在不损害准确率的前提下,显著降低深度神经网络在设备端语音识别中的架构与计算复杂度?
- RQ2与传统基于CNN的架构相比,注意力压缩器在有限词汇量语音识别任务中,对参数量、FLOPs与内存使用的影响如何?
- RQ3机器驱动的设计探索策略在多大程度上可优化低精度神经网络,以部署于微控制器类边缘设备?
- RQ4注意力压缩器是否能在微控制器等极端资源受限环境下实现高准确率语音识别?
主要发现
- TinySpeech网络相比先前最先进模型(如trad-fpool13)最多减少507倍参数量。
- 所提网络最多减少48倍乘加运算量(FLOPs),显著降低计算成本。
- 权重内存需求最多降低2028倍,使在超低内存设备上部署成为可能。
- TinySpeech-M在参数量减少约291倍、权重内存降低约1164倍的同时,准确率比trad-fpool13高出1.4%。
- 该模型FLOPs数量比trad-fpool13减少超过28.4倍,证明了其显著的计算效率。
- 结果证实,注意力压缩器可在准确率、模型大小与推理成本之间实现强大权衡,使其成为TinyML应用的理想选择。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。