QUICK REVIEW

[论文解读] Unsupervised Speech Recognition via Segmental Empirical Output Distribution Matching

Chih‐Kuan Yeh, Jianshu Chen|arXiv (Cornell University)|Dec 22, 2018

Speech Recognition and Synthesis被引用 26

一句话总结

该论文提出了一种完全无监督的语音识别系统，通过交替训练一种基于新型分段经验输出分布匹配（SE-ODM）损失的音素分类器，以及利用近似最大后验（MAP）方法优化音素边界。该方法在无任何标注数据的TIMIT数据集上实现了41.6%的音素错误率（PER），在提供理想边界的情况下达到32.5%的PER，接近使用相同架构的监督模型性能，证明了无监督语音识别的强大潜力。

ABSTRACT

We consider the problem of training speech recognition systems without using any labeled data, under the assumption that the learner can only access to the input utterances and a phoneme language model estimated from a non-overlapping corpus. We propose a fully unsupervised learning algorithm that alternates between solving two sub-problems: (i) learn a phoneme classifier for a given set of phoneme segmentation boundaries, and (ii) refining the phoneme boundaries based on a given classifier. To solve the first sub-problem, we introduce a novel unsupervised cost function named Segmental Empirical Output Distribution Matching, which generalizes the work in (Liu et al., 2017) to segmental structures. For the second sub-problem, we develop an approximate MAP approach to refining the boundaries obtained from Wang et al. (2017). Experimental results on TIMIT dataset demonstrate the success of this fully unsupervised phoneme recognition system, which achieves a phone error rate (PER) of 41.6%. Although it is still far away from the state-of-the-art supervised systems, we show that with oracle boundaries and matching language model, the PER could be improved to 32.5%.This performance approaches the supervised system of the same model architecture, demonstrating the great potential of the proposed method.

研究动机与目标

开发一种完全无监督的语音识别系统，无需任何标注数据或强制对齐。
解决语音中音素的分段结构问题，即音素由长度可变的帧序列组成，且边界未知。
通过联合优化分类器与分割边界估计，提升无监督音素识别性能。
证明当提供准确边界时，无监督模型可接近监督模型的性能。
将经验输出分布匹配（ODM）框架推广至序列建模中的分段结构。

提出的方法

提出分段经验ODM（SE-ODM），一种新型无监督损失函数，强制每个分段内预测输出一致，并将分段级输出分布与预训练的音素语言模型相匹配。
使用神经网络直接将原始声学特征映射为音素序列，避免聚类或嵌入方法。
应用近似最大后验（MAP）推理方法，基于当前分类器优化音素边界，使用Wang等人（2017）提出的基于GRU的自编码器作为边界初始化。
在SE-ODM分类器训练与边界优化之间交替进行，实现对两个组件的迭代优化。
将半监督HMM学习技术适配至无监督设置，以进一步提升性能。
使用来自非重叠文本语料的预训练音素语言模型，实现在无转录语音数据情况下的语言建模。

实验结果

研究问题

RQ1能否在无需任何标注帧或转录文本的情况下训练完全无监督的语音识别系统？
RQ2在缺乏边界标注的情况下，如何有效建模音素跨度可变帧序列的分段结构？
RQ3当仅有声学特征和语言模型时，像SE-ODM这样的新型无监督损失函数能否提升分类器性能？
RQ4在仅使用当前分类器的情况下，无监督方式能将边界估计提升到何种程度？
RQ5当提供准确音素边界时，无监督系统的性能在多大程度上可接近监督系统？

主要发现

完全无监督系统在TIMIT音素识别基准上实现了41.6%的音素错误率（PER），这是首个在无理想边界条件下实现的完全无监督ASR系统实证成功。
在提供理想音素分割边界的情况下，系统PER降至32.5%，与使用相同模型架构的监督系统性能极为接近。
SE-ODM损失有效对齐了预测输出分布与语言模型分布，并在分段内强制一致性，使无标签训练成为可能。
通过近似MAP进行的边界迭代优化显著提升了分割精度，并显著增强了整体识别性能。
该方法展现出强大的泛化潜力，其核心技术可推广至其他缺乏标签的序列到序列任务。
结果验证了当结合准确边界估计时，无监督ASR可实现高性能，表明未来在边界学习方面的改进有望进一步缩小与监督系统之间的差距。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。