QUICK REVIEW

[논문 리뷰] Can Generative Large Language Models Perform ASR Error Correction?

Rao Ma, Mengjie Qian|arXiv (Cornell University)|2023. 07. 09.

Speech Recognition and Synthesis인용 수 7

한 줄 요약

이 논문은 Conformer-Transducer 및 Whisper AED의 두 ASR 아키텍처에서 ChatGPT를 이용한 제로샷 및 소수샷 ASR 오류 수정의 성능을, 제약 없는( unconstrained ), 선택적( selective ), 가장 근접한 매핑( closest-mapping ) 접근법과 0샷 및 1샷 프롬프트를 비교합니다.

ABSTRACT

ASR error correction is an interesting option for post processing speech recognition system outputs. These error correction models are usually trained in a supervised fashion using the decoding results of a target ASR system. This approach can be computationally intensive and the model is tuned to a specific ASR system. Recently generative large language models (LLMs) have been applied to a wide range of natural language processing tasks, as they can operate in a zero-shot or few shot fashion. In this paper we investigate using ChatGPT, a generative LLM, for ASR error correction. Based on the ASR N-best output, we propose both unconstrained and constrained, where a member of the N-best list is selected, approaches. Additionally, zero and 1-shot settings are evaluated. Experiments show that this generative LLM approach can yield performance gains for two different state-of-the-art ASR architectures, transducer and attention-encoder-decoder based, and multiple test sets.

연구 동기 및 목표

ASR 출력에 대한 재학습 없이 사후처리로서의 오류 수정의 동기 부여 및 평가.
다양한 ASR 아키텍처와 데이터셋에서 제로샷 및 소수샷 생성형 LLM 접근법의 평가.
ChatGPT를 사용한 오류 수정에서 무제한(prompts without constraints)와 제약된(N-best) 프롬팅 전략의 비교.

제안 방법

오디오 인식의 N-best 목록(기본 상위 5개)을 오류 수정기 없이 ChatGPT에 입력합니다.
제로샷 무제한(prompts unconstrained) 및 제로샷 선택적(prompts selective), 1샷 무제한(prompts unconstrained) 프롬프트를 테스트합니다.
출력 범위를 N-best 후보로 제한하기 위해 제약 디코딩 변형을 적용합니다: 선택적 접근법과 가장 근접 매핑(closest mapping).
두 ASR 시스템(Conformer-Transducer 및 Whisper small.en)에서 LibriSpeech 테스트 세트, TED-LIUM3, Artie Bias에 대해 평가합니다.
ChatGPT 결과를 미세 조정된 N-best T5 오류 수정 모델과 비교합니다.

Fig. 1 : N-best T5 error correction model structure.

실험 결과

연구 질문

RQ1생성형 LLM(ChatGPT)이 추가 학습 없이 제로샷 및 1샷 설정에서 ASR 오류 수정 성능을 개선할 수 있는가?
RQ2N-best 제약 디코딩 변형(선택적, 가장 근접)들이 ChatGPT를 사용할 때 무제한 생성에 비해 이익을 가져오는가?
RQ3ChatGPT 기반 수정이 서로 다른 ASR 아키텍처와 도메인에 걸쳐 미세 조정된 T5 N-best 오류 수정 모델과 어떻게 비교되는가?

주요 결과

ChatGPT는 Conformer-Transducer와 Whisper AED 시스템 모두에서 ASR 기준선 대비 성능 이점을 제공합니다.
제로샷 1샷 무제한 프롬프트를 사용한 ChatGPT가 WER을 개선하며, 1샷에서 가장 근접한 방식은 일부 설정에서 T5의 성능에 근접하는 경우가 많습니다.
제약된 접근법(closest, selective)은 제약되지 않은 제로샷보다 우수한 성능을 낼 수 있으며, closest가 종종 강력한 결과를 제공합니다.
ChatGPT 기반 수정은 LibriSpeech 관련 도메인 내 데이터에서 더 큰 이득을 보이고 일부 도메인 외부 세트에서도 이득이 있지만, N-best 다양성 이슈로 Whisper 기반 케이스에서 악화될 수 있습니다.
T5 기반 오류 수정에 비해 ChatGPT는 여러 시나리오에서 경쟁력이 있거나 우수할 수 있으며, 사전 모델 학습이 필요하지 않습니다.

Fig. 2 : Prompt design for (a) zero-shot unconstrained error correction, (b) zero-shot selective approach, and (c) 1-shot unconstrained error correction. Here we use a 3-best list generated by the ASR system as input to ChatGPT for illustration.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.