QUICK REVIEW

[论文解读] You don't understand me!: Comparing ASR results for L1 and L2 speakers of Swedish

Ronald Cumbal, Birger Moëll|arXiv (Cornell University)|May 22, 2024

Speech and dialogue systems被引用 7

一句话总结

本文比较了瑞典语第一语言（L1）与第二语言（L2）说话者在朗读和自发语音条件下的自动语音识别（ASR）性能，使用三种ASR服务（Google、Microsoft、Huggingface），并分析错误类型与话语长度的影响。

ABSTRACT

The performance of Automatic Speech Recognition (ASR) systems has constantly increased in state-of-the-art development. However, performance tends to decrease considerably in more challenging conditions (e.g., background noise, multiple speaker social conversations) and with more atypical speakers (e.g., children, non-native speakers or people with speech disorders), which signifies that general improvements do not necessarily transfer to applications that rely on ASR, e.g., educational software for younger students or language learners. In this study, we focus on the gap in performance between recognition results for native and non-native, read and spontaneous, Swedish utterances transcribed by different ASR services. We compare the recognition results using Word Error Rate and analyze the linguistic factors that may generate the observed transcription errors.

研究动机与目标

评估母语（L1）与非母语（L2）瑞典语在朗读与自发语音中的词错误率（WER）差距。
在非理想条件下评估多种现成的瑞典语 ASR 系统。
识别常见的转录错误与导致误识别的语言学因素。
考察话语长度对 L1 与 L2 瑞典语语音的 ASR 性能影响。
讨论使用 ASR 的教育与语言学习应用的影响。

提出的方法

使用两个瑞典语 L2 数据集（Ville 朗读句子；CORALL 社会对话）包含母语和非母语说话者。
测试三种 ASR 系统：Google Cloud Speech-to-Text、Microsoft Azure Speech-to-Text，以及 Huggingface wav2vec2 基于的模型。
以词错误率（WER）和无法识别样本数（NFR）来衡量性能。
按话语长度（短、中、长）分段结果，以分析长度效应。
分析转录错误，识别经常误识别的词汇及类别（删除 vs 替换）。
进行统计检验（Welch's t-test）以评估母语与非母语差异的显著性。

实验结果

研究问题

RQ1母语与非母语语音在朗读与自发瑞典语中的 ASR 性能差距是否持续存在？
RQ2不同的 ASR 服务在处理 L1 与 L2 瑞典语时的表现如何？
RQ3非母语瑞典语的常见错误模式有哪些？它们与母语语音有区别吗？
RQ4话语长度如何影响 L1 与 L2 语音的 ASR 性能？
RQ5ASR 弱点对教育或语言学习应用有哪些影响？

主要发现

Dataset	Speaker Type	Google WER	Microsoft WER	Huggingface WER
Ville (Read sentences)	Native	0.162	0.111	0.522
Ville (Read sentences)	Non-native	0.325	0.410	0.593
CORALL (Social conv.)	Native	0.412	0.356	0.641
CORALL (Social conv.)	Non-native	0.421	0.507	0.663

母语说话者通常比非母语说话者获得更低的 WER，在某些 ASR 中，这一差距在朗读句子中更明显，在自发语音中则较少。
Microsoft Azure 在自发语音中显示出显著的母语与非母语差异（N: 0.36 vs NN: 0.51，p<0.05）。
Google Cloud 与 Huggingface 在本研究数据集的自发语音中未显示统计显著的母语与非母语差异。
在朗读句子中，较长的话语通常对母语者的 WER 更好，但对非母语者的影响则混合且随 ASR 而异。
自发语音常导致许多无法识别的短语（NFR），尤以 Google 和 Microsoft 为甚，影响交互式教育环境中的可用性。
常见的误识别包括短功能词（如 ja、och、du、jag）和学习者特有词汇（如 förstår、repetera），突出语言学习信号词易出错。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。