QUICK REVIEW

[论文解读] The Sound of Healthcare: Improving Medical Transcription ASR Accuracy with Large Language Models

Ayo Adedeji, Sarita Joshi|arXiv (Cornell University)|Feb 12, 2024

Phonocardiography and Auscultation Techniques被引用 11

一句话总结

该论文研究使用大语言模型（LLMs）对医疗转写的ASR转录进行后处理，通过对 PriMock57 数据集进行零-shot 与链式思维提示，在 WER、MC-WER 与说话人分段方面取得提升。

ABSTRACT

In the rapidly evolving landscape of medical documentation, transcribing clinical dialogues accurately is increasingly paramount. This study explores the potential of Large Language Models (LLMs) to enhance the accuracy of Automatic Speech Recognition (ASR) systems in medical transcription. Utilizing the PriMock57 dataset, which encompasses a diverse range of primary care consultations, we apply advanced LLMs to refine ASR-generated transcripts. Our research is multifaceted, focusing on improvements in general Word Error Rate (WER), Medical Concept WER (MC-WER) for the accurate transcription of essential medical terms, and speaker diarization accuracy. Additionally, we assess the role of LLM post-processing in improving semantic textual similarity, thereby preserving the contextual integrity of clinical dialogues. Through a series of experiments, we compare the efficacy of zero-shot and Chain-of-Thought (CoT) prompting techniques in enhancing diarization and correction accuracy. Our findings demonstrate that LLMs, particularly through CoT prompting, not only improve the diarization accuracy of existing ASR systems but also achieve state-of-the-art performance in this domain. This improvement extends to more accurately capturing medical concepts and enhancing the overall semantic coherence of the transcribed dialogues. These findings illustrate the dual role of LLMs in augmenting ASR outputs and independently excelling in transcription tasks, holding significant promise for transforming medical ASR systems and leading to more accurate and reliable patient records in healthcare settings.

研究动机与目标

评估 LLM 是否能够超越基线 ASR 性能来提升医疗转录的 ASR 输出。
评估在 LLM 后处理后的一般 WER、Medical Concept WER（MC-WER）与说话人分段准确性。
比较零-shot 提示与链式思维提示在分段与纠错方面的效果。
分析标点质量对分段与纠错性能的影响。
证明 LLM 能否在医学转录任务中达到最先进的结果。

提出的方法

使用 PriMock57 数据集（57 次模拟会诊，约 9 小时）及其真实文本转录与分段。
将六种 ASR 系统（GCMC、Chirp、Whisper 1、Amazon Transcribe Medical、Soniox、Deepgram Nova 2）作为基线进行评估。
应用多种 LLM（Gemini Pro/Ultra、Text Bison 32k、Claude V2、GPT-4、PaLM Gecko/2、Ada embeddings、LLaMA 2）对 ASR 输出进行后处理。
以零-shot 模板对 LLM 进行提示，以分段长度（5、10 行，或完整转录）来分段并纠错。
实现链式思维提示，将任务分解为标点、分段与纠错步骤，并提供少量示例推理。
应用正则解析与 Smith-Waterman 对齐来处理输出格式并缓解文本降解；通过标准化预处理计算 WER/MC-WER。

Figure 1: Excerpt from a mock consultation showing the ground truth transcript (top) and the corresponding ASR output (bottom). Errors are annotated: substitutions in red, deletions in blue, and insertions in green.

实验结果

研究问题

RQ1LLMs 能否通过后处理提高医疗 ASR 输出的词错误率（WER）？
RQ2LLMs 是否通过更好地识别和规范化医学术语来提高 Medical Concept WER（MC-WER）？
RQ3在 LLM 后处理的 ASR 输出中，分段准确率如何，并且 Chain-of-Thought 提示是否优于零-shot 提示？
RQ4标点质量对分段与纠错性能有何影响？
RQ5LLM 后处理方法是否能在医生与患者分段的多输入窗口大小下实现最先进的结果？

主要发现

在多种配对中，使用 Chain-of-Thought 提示的 LLM 后处理在分段的 diarization 方面优于基线 ASR，且在 Doctor-Specific Diarizaton（D-WER）方面有显著提升。
某些 LLM/ASR 配对（如 GPT-4 或 Gemini Pro/Ultra 与 Whisper 1）在 10-line 块实验中达到低于所有 ASR 基准的 D-WER。
在患者特定的分段中，LLMs 具有竞争力，且在某些 ASR 基准之上，尤其是在一次性处理完整转录时，一些 LLM/ASR 配对超过基线。
Whisper 1 一般在评估系统中产生最低的 MC-WER，若与 LLMs（如 GPT-4、Gemini Ultra）搭配，进一步降低医学概念错误。
标点质量对分段影响显著；某些 ASR 模型的对抗性标点可通过 CoT 工作流中的初始标点增强步骤来缓解。
该研究展示了双重收益：LLMs 既能对 ASR 输出进行后处理，又在转录任务中表现出色，凸显了改良医疗记录的潜力。

Figure 2: Visual of medical concepts recognized by the Healthcare Natural Language API

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。