QUICK REVIEW

[论文解读] DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization

Shaoshi Ling, Yuzong Liu|arXiv (Cornell University)|Dec 11, 2020

Speech Recognition and Synthesis参考文献 34被引用 58

一句话总结

DeCoAR 2.0 使用 Transformer 编码器和带多样性目标的向量量化层来学习用于半监督语音识别的深度上下文化声学表征，在与基线相比的数据标注有限时，实现具有竞争力的 WER。

ABSTRACT

Recent success in speech representation learning enables a new way to leverage unlabeled data to train speech recognition model. In speech representation learning, a large amount of unlabeled data is used in a self-supervised manner to learn a feature representation. Then a smaller amount of labeled data is used to train a downstream ASR system using the new feature representations. Based on our previous work DeCoAR and inspirations from other speech representation learning, we propose DeCoAR 2.0, a Deep Contextualized Acoustic Representation with vector quantization. We introduce several modifications over the DeCoAR: first, we use Transformers in encoding module instead of LSTMs; second, we introduce a vector quantization layer between encoder and reconstruction modules; third, we propose an objective that combines the reconstructive loss with vector quantization diversity loss to train speech representations. Our experiments show consistent improvements over other speech representations in different data-sparse scenarios. Without fine-tuning, a light-weight ASR model trained on 10 hours of LibriSpeech labeled data with DeCoAR 2.0 features outperforms the model trained on the full 960-hour dataset with filterbank features.

研究动机与目标

利用未标注的语音数据学习鲁棒的用于自动语音识别的声学表征。
通过用 Transformer 取代 LSTM 并加入向量量化来提升表征质量。
将重建损失与多样性目标结合，以训练离散的语音表征。
在数据稀缺的半监督 ASR 场景中证明其有效性。
分析 VQ 模块对下游 ASR 性能的影响。

提出的方法

编码器：1D 卷积层，随后是 Transformer 块，以产生潜在的 z 表征（掩蔽帧策略）。
量化：使用带 Gumbel-Softmax 和直通估计器的多个码本，将 z 映射到使用离散码字的量化 v。
重建：一个前馈网络用 L1 损失从量化表征重构原始帧。
多样性损失：鼓励码本条目的均匀使用，以促进有信息量的语言单位。
联合目标：L = L_recon + alpha * L_div，用于训练模型。
半监督下游：预训练后冻结编码器；将其附加到下游 ASR 模型，且不对编码器进行微调；使用 CTC 损失用于 ASR。

实验结果

研究问题

RQ1基于 Transformer 的编码器结合向量量化，能否从未标注数据中产生鲁棒、具上下文的声学表征？
RQ2将重建损失与多样性损失结合是否在低资源有标签数据条件下提升 ASR 性能？
RQ3在半监督 LibriSpeech 设置中，DeCoAR 2.0 与其他表征学习方法（如 wav2vec 2.0、VQ-APC）相比如何？
RQ4在数据稀缺场景中，VQ 层对下游 ASR 准确性的影响是什么？

主要发现

在某些条件下，DeCoAR 2.0 使用 10 小时标注数据的表现可匹配甚至超越在 960 小时、且使用滤波系数特征的系统。
在极端数据稀缺场景中，DeCoAR 2.0 在有 10 小时标注数据时实现 5.43%（test-clean）和 13.27%（test-other）的 WER。
在有 1 小时标注数据时，DeCoAR 2.0 实现 13.75%（test-clean）和 29.13%（test-other）的 WER。
消融研究表明，在 LibriSpeech 10 小时 SSL 设置中，VQ 层对 ASR 性能有益（无 VQ 时：6.29/18.54，对有 VQ 时：5.43/13.27）。
在半监督场景中，DeCoAR 2.0 的表现与 wav2vec 2.0 相当。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。