QUICK REVIEW

[论文解读] A Comparison of Online Automatic Speech Recognition Systems and the Nonverbal Responses to Unintelligible Speech

Joshua Y. Kim, Chunfeng Liu|arXiv (Cornell University)|Apr 28, 2019

Speech and dialogue systems被引用 23

一句话总结

本研究利用医学生与模拟患者之间视频会议对话的数据，评估了五种在线自动语音识别（ASR）系统——Google Cloud、IBM Watson、Microsoft Azure、Trint 和 YouTube——与人工转录文本的对比。研究发现 YouTube ASR 的准确度最高，而较高的词错误率与听者微笑变化的波动性相关，表明非语言线索可反映语音不清晰的情况。

ABSTRACT

Automatic Speech Recognition (ASR) systems have proliferated over the recent years to the point that free platforms such as YouTube now provide speech recognition services. Given the wide selection of ASR systems, we contribute to the field of automatic speech recognition by comparing the relative performance of two sets of manual transcriptions and five sets of automatic transcriptions (Google Cloud, IBM Watson, Microsoft Azure, Trint, and YouTube) to help researchers to select accurate transcription services. In addition, we identify nonverbal behaviors that are associated with unintelligible speech, as indicated by high word error rates. We show that manual transcriptions remain superior to current automatic transcriptions. Amongst the automatic transcription services, YouTube offers the most accurate transcription service. For non-verbal behavioral involvement, we provide evidence that the variability of smile intensities from the listener is high (low) when the speaker is clear (unintelligible). These findings are derived from videoconferencing interactions between student doctors and simulated patients; therefore, we contribute towards both the ASR literature and the healthcare communication skills teaching community.

研究动机与目标

评估并比较五种主流在线 ASR 系统与人工转录文本之间的转录准确度。
识别与语音不清晰相关的非语言行为反应，特别是在医疗沟通情境中。
理解听者非语言线索（如面部表情）如何随词错误率衡量的语音清晰度变化而变化。
为医疗沟通培训中的研究和临床应用提供准确 ASR 工具的选择支持。

提出的方法

通过医学生与模拟患者之间的视频会议互动收集口语对话数据。
收集人工转录文本作为与自动转录对比的金标准。
应用五种在线 ASR 系统——Google Cloud、IBM Watson、Microsoft Azure、Trint 和 YouTube——对同一段音频数据进行转录。
计算词错误率（WER）以定量比较 ASR 系统与人工转录的性能。
利用面部关键点检测和微笑强度指标分析听者的面部表情，评估其对语音清晰度的非语言反应。
将 WER 值与微笑强度波动性相关联，识别与语音清晰度相关的的行为模式。

实验结果

研究问题

RQ1与人工转录相比，哪种在线 ASR 系统产生的转录最为准确？
RQ2当语音被认为不清晰时，非语言行为（尤其是微笑强度）如何变化？
RQ3词错误率与听者面部表情波动性之间是否存在可测量的关系？
RQ4非语言线索能否作为实时沟通中语音清晰度的可靠指标？

主要发现

YouTube 的 ASR 服务在五种评估系统中表现出最低的词错误率，因此在本数据集中准确度最高。
人工转录的准确度显著高于所测试的任何一种自动转录系统。
当说话者语音清晰度较低时（表现为较高的词错误率），听者的微笑强度波动性增加。
在语音不清晰时观察到微笑强度的显著波动，表明听者可能因沟通障碍而产生情绪或认知上的反应。
词错误率与非语言反应之间的相关性支持将面部行为用作实时场景中语音清晰度的代理指标。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。