QUICK REVIEW

[論文レビュー] A Comparison of Online Automatic Speech Recognition Systems and the Nonverbal Responses to Unintelligible Speech

Joshua Y. Kim, Chunfeng Liu|arXiv (Cornell University)|Apr 28, 2019

Speech and dialogue systems被引用数 23

ひとこと要約

本研究では、医学生と患者のインタラクションを記録した動画会議データを用いて、手動トランスクリプションと比較して、5つのオンライン自動音声認識（ASR）システム—Google Cloud、IBM Watson、Microsoft Azure、Trint、YouTube—の性能を評価した。その結果、YouTube ASRが最も高い正確性を示した一方、高い語誤り率は聴取者の笑顔のばらつきと相関しており、非言語的サインが言語の理解不能さを示していることがわかった。

ABSTRACT

Automatic Speech Recognition (ASR) systems have proliferated over the recent years to the point that free platforms such as YouTube now provide speech recognition services. Given the wide selection of ASR systems, we contribute to the field of automatic speech recognition by comparing the relative performance of two sets of manual transcriptions and five sets of automatic transcriptions (Google Cloud, IBM Watson, Microsoft Azure, Trint, and YouTube) to help researchers to select accurate transcription services. In addition, we identify nonverbal behaviors that are associated with unintelligible speech, as indicated by high word error rates. We show that manual transcriptions remain superior to current automatic transcriptions. Amongst the automatic transcription services, YouTube offers the most accurate transcription service. For non-verbal behavioral involvement, we provide evidence that the variability of smile intensities from the listener is high (low) when the speaker is clear (unintelligible). These findings are derived from videoconferencing interactions between student doctors and simulated patients; therefore, we contribute towards both the ASR literature and the healthcare communication skills teaching community.

研究の動機と目的

5つの主要なオンラインASRシステムの手動トランスクリプションとの比較によるトランスクリプション正確性の評価と比較。
とりわけ医療コミュニケーション文脈において、理解不能な言語に伴う非言語的行動反応の特定。
語誤り率で測定された言語の明瞭さに応じて、聴取者の非言語的サイン（顔の表情など）がどのように変化するかの理解。
医療コミュニケーション訓練における研究および臨床的応用のための正確なASRツールの選定を支援。

提案手法

医学生と模擬患者の間でビデオ会議による対話を実施し、会話データを収集。
自動トランスクリプションとの比較のための基準として、手動トランスクリプションを収集。
同じ音声データを、Google Cloud、IBM Watson、Microsoft Azure、Trint、YouTube の5つのオンラインASRシステムに処理。
手動トランスクリプションとの比較でASRシステムのパフォーマンスを定量的に評価するため、語誤り率（WER）を計算。
顔の特徴点検出と笑顔の強度指標を用いて、聴取者の顔の表情を分析し、言語の明瞭さに対する非言語的反応を評価。
WER値と笑顔の強度のばらつきを相関させ、言語の明瞭さに関連する行動パターンを同定。

実験結果

リサーチクエスチョン

RQ1手動トランスクリプションと比較した場合、どのオンラインASRシステムが最も正確なトランスクリプションを生成するか？
RQ2理解不能と感じられる言語に対して、特に笑顔の強度といった非言語的行動はどのように変化するか？
RQ3語誤り率と聴取者の顔の表情のばらつきの間に測定可能な関係があるか？
RQ4非言語的サインは、リアルタイムのコミュニケーションにおいて、言語の明瞭さの信頼できる指標として機能できるか？

主な発見

YouTubeのASRサービスは、評価された5つのシステムの中で最も低い語誤り率を示し、本データセットでは最も正確であった。
手動トランスクリプションは、テストされたすべての自動トランスクリプションシステムよりも顕著に正確であった。
話者の発話が明瞭でない場合、語誤り率が高いと、聴取者の笑顔の強度のばらつきが増加した。
理解不能な発話では、笑顔の強度の高いばらつきが観察され、コミュニケーションの障害に対して感情的または認知的反応が生じている可能性を示唆した。
語誤り率と非言語的反応の相関関係は、リアルタイム環境において顔の動きを言語の明瞭さの代理指標として用いることが有効であることを裏付けた。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。