QUICK REVIEW

[论文解读] Multilingual End-to-End Speech Recognition with A Single Transformer on Low-Resource Languages

Shiyu Zhou, Shuang Xu|arXiv (Cornell University)|Jun 12, 2018

Speech Recognition and Synthesis参考文献 22被引用 62

一句话总结

论文表明使用子词（通过 BPE）的单一多语言 ASR Transformer 可以识别六种资源匮乏语言，并且在末尾注入语言信息或作为句子开头标记能降低 WER，在已知语言条件下，B2 产生最佳结果。

ABSTRACT

Sequence-to-sequence attention-based models integrate an acoustic, pronunciation and language model into a single neural network, which make them very suitable for multilingual automatic speech recognition (ASR). In this paper, we are concerned with multilingual speech recognition on low-resource languages by a single Transformer, one of sequence-to-sequence attention-based models. Sub-words are employed as the multilingual modeling unit without using any pronunciation lexicon. First, we show that a single multilingual ASR Transformer performs well on low-resource languages despite of some language confusion. We then look at incorporating language information into the model by inserting the language symbol at the beginning or at the end of the original sub-words sequence under the condition of language information being known during training. Experiments on CALLHOME datasets demonstrate that the multilingual ASR Transformer with the language symbol at the end performs better and can obtain relatively 10.5\% average word error rate (WER) reduction compared to SHL-MLSTM with residual learning. We go on to show that, assuming the language information being known during training and testing, about relatively 12.4\% average WER reduction can be observed compared to SHL-MLSTM with residual learning through giving the language symbol as the sentence start token.

研究动机与目标

研究使用单一 Transformer 的低资源语言多语言端到端 ASR。
评估通过 BPE 的子词单元是否可以消除对发音词典的需求。
探讨在解码阶段注入语言信息以降低语言混淆的方法。

提出的方法

使用具有多头注意力和逐位置前馈层的单一 ASR Transformer。
采用来自 BPE 的子词作为跨语言共享的多语言建模单位。
用语言标记扩展符号词汇表，并比较插入点（开头 vs 结尾）以及测试时的用法（在已知语言时）。
从高资源语言模型初始化多语言训练以应对数据有限的问题，并将 softmax 替换为语言特定输出。
尝试不同数量的 BPE 合并（α）以在子词词汇量与每个子词的数据量之间取得平衡。
对最后20个模型检查点取平均以获得稳定性。

实验结果

研究问题

RQ1单一的多语言 Transformer 是否在没有发音词典的情况下，对低资源语言实现有竞争力的 WER？
RQ2在跨语言的 WER 中，将语言信息嵌入为句子开头/结尾标记，或在语言已知时作为起始标记，对 WER 的影响是什么？
RQ3在单语与多语言设置下，BPE 合并数量（α）如何影响性能？
RQ4在训练（和测试）阶段使用语言信息，能否在多语言端到端 ASR 中减少语言混淆？

主要发现

在末尾带语言符号（Transformer-E）的单一多语言 ASR Transformer，相对于 SHL-MLSTM-RESIDUAL 实现了高达 10.5% 的相对平均 WER 降幅。
当在训练和测试阶段都已知语言信息时，使用语言符号作为句子开头标记（Transformer-B2）相对于 SHL-MLSTM-RESIDUAL 产生约 12.4% 的相对平均 WER 降幅。
使用共享子词的多语言训练通常在平均性能上优于单语模型；然而，在缺乏语言条件时，语言混淆仍是一个问题。
最佳多语言配置（B2）在若干语言上显著降低了 WER，并展示了在给定语言提示时将解码到正确语言的能力。
在所有语言中，具有结束端语言条件的多语言 Transformer 的表现始终优于将标记放在开头或不使用测试语言信息的变体。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。