QUICK REVIEW

[论文解读] Iterative Pseudo-Labeling for Speech Recognition

Qiantong Xu, Tatiana Likhomanenko|arXiv (Cornell University)|May 19, 2020

Speech Recognition and Synthesis参考文献 36被引用 27

一句话总结

本文提出迭代伪标签（IPL），一种半监督语音识别方法，通过微调预训练声学模型，迭代地在无标签数据上生成并优化伪标签，显著提升性能。IPL 在 LibriSpeech 数据集上实现最先进（SOTA）的词错误率（WER），在 960 小时标注数据下将测试集-other 的 WER 降低至 1.85%，在仅 100 小时标注数据下降至 3.19%，同时通过模型微调和数据集子采样实现计算高效。

ABSTRACT

Pseudo-labeling has recently shown promise in end-to-end automatic speech recognition (ASR). We study Iterative Pseudo-Labeling (IPL), a semi-supervised algorithm which efficiently performs multiple iterations of pseudo-labeling on unlabeled data as the acoustic model evolves. In particular, IPL fine-tunes an existing model at each iteration using both labeled data and a subset of unlabeled data. We study the main components of IPL: decoding with a language model and data augmentation. We then demonstrate the effectiveness of IPL by achieving state-of-the-art word-error rate on the Librispeech test sets in both standard and low-resource setting. We also study the effect of language models trained on different corpora to show IPL can effectively utilize additional text. Finally, we release a new large in-domain text corpus which does not overlap with the Librispeech training transcriptions to foster research in low-resource, semi-supervised ASR

研究动机与目标

通过利用大规模无标签语音数据，弥合低资源自动语音识别（ASR）中的性能差距。
克服在每次伪标签生成迭代中从头训练导致的计算效率低下问题。
通过语言模型从多样化文本语料中有效迁移知识，提升模型泛化能力。
证明通过微调迭代优化伪标签可带来持续的性能提升，优于单次伪标签生成方法。
提供一种可扩展、高效的半监督训练框架，适用于 LibriLight 等大规模数据集。

提出的方法

在每次迭代中，使用当前声学模型和语言模型，在下采样后的无标签数据子集上进行束搜索解码，生成伪标签。
在标注数据和新生成的伪标签数据上微调现有声学模型，避免从头训练。
在每次微调步骤中应用数据增强，以提升鲁棒性和泛化能力。
对无标签数据集进行子采样，以降低推理时间和计算成本，同时保持性能。
在解码过程中使用连接时序分类（CTC）损失，确保伪标签生成的稳定性。
采用多阶段训练策略：模型首先在标注数据上进行预训练，然后通过伪标签数据迭代优化。

Figure 7: WER with different LM weights used in beam-search decoding and rescoring.

实验结果

研究问题

RQ1通过微调实现的伪标签迭代优化是否能超越单次伪标签生成，进一步提升 ASR 性能？
RQ2语言模型的选择，特别是领域内与领域外文本，如何影响伪标签质量及最终模型性能？
RQ3数据子采样与微调在多大程度上可减少训练时间，同时保持或提升准确率？
RQ4在迭代设置中，结合语言模型的束搜索解码是否比贪婪解码生成更高质量的伪标签？
RQ5IPL 是否能有效利用大规模无配对文本语料，在低资源 ASR 场景中提升性能？

主要发现

在 960 小时标注数据下，IPL 在 LibriSpeech test-other 上实现最先进（SOTA）的词错误率 1.85%，优于先前方法。
在仅 100 小时标注数据下，IPL 将 test-other 的 WER 降低至 3.19%，展现出在低资源场景下的强大有效性。
使用 4-gram 语言模型与 Transformer 语言模型进行重排序时，IPL 在 960 小时标注数据和 54K 领域内文本下，实现 test-other 上 3.26% 的 WER。
与从头训练相比，IPL 将训练时间最多减少 80%：在 8 天内达到 4.12% WER，而完整重训需 17 天。
对无标签数据采用 20% 的下采样率，可使伪标签生成速度提升 5 倍，且性能损失可忽略。
即使使用困惑度更高的语言模型，IPL 在使用领域内文本（如 LV-54K）时仍取得更优的 WER，表明其在存在潜在标签泄露的情况下仍能实现有效的知识迁移。

Figure 8: Comparison of the WER heatmap of rescoring LM weight on development and training set. Left: decoding a checkpoint of an AM that is not fully converged, with (a) WER on dev-other and (b) WER on train-clean-100 and train-other-500 . Right: decoding a fully converged AM, with (c) WER on dev-o

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。