QUICK REVIEW

[논문 리뷰] CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings

Shinji Watanabe, Michael Mandel|arXiv (Cornell University)|2020. 04. 20.

Speech Recognition and Synthesis참고 문헌 49인용 수 97

한 줄 요약

논문은 두 트랙으로 CHiME-6를 소개한다: 트랙 1은 구간화된 다중 화자 ASR, 트랙 2는 구간화되지 않은 다중 화자 ASR 및 다이어라이제이션을 포함하며, 엔드 투 엔드 다중 화자 처리를 위한 오픈 소스 Kaldi 베이스라인을 제공한다.

ABSTRACT

Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous CHiME-5 recordings except for accurate array synchronization. The material was elicited using a dinner party scenario with efforts taken to capture data that is representative of natural conversational speech. This paper provides a baseline description of the CHiME-6 challenge for both segmented multispeaker speech recognition (Track 1) and unsegmented multispeaker speech recognition (Track 2). Of note, Track 2 is the first challenge activity in the community to tackle an unsegmented multispeaker speech recognition scenario with a complete set of reproducible open source baselines providing speech enhancement, speaker diarization, and speech recognition modules.

연구 동기 및 목표

실제 가정 환경에서 원거리 마이크 다중 화자 ASR을 두 트랙(세그먼트된 것과 세그먼트되지 않은 것)으로 향상시키다.
Kaldi에 통합된 음성 향상, 다이어라이제이션 및 ASR 구성요소를 포함하는 재현 가능한 베이스라인을 제공한다.
실제 다이어라이제이션 사용 환경에서 다이어라이제이션 오류가 인식 성능에 미치는 영향을 정량화한다.
세그먼트되지 않은 다중 화자 ASR에 도전하는 연구자의 진입 장벽을 낮추기 위해 오픈 소스 레시피를 제공한다.

제안 방법

두 가지 챌린지 트랙: 트랙 1(정답 다이어라이제이션을 이용한 ASR)과 트랙 2(다이어라이제이션 + ASR).
여러 상용 4채널 마이크 배열을 정렬하기 위한 배열 동기화 베이스라인.
가이드 소스 분리(GSS) 및 BeamformIt 기반의 음성 향상 프런트엔드와 선택적 WPE 도 derverberation를 포함.
MFCC 특징, GMM-HMM 및 체인 TDNN-F 음향 모델을 포함하는 Kaldi 기반 ASR 파이프라인.
데이터 증강, 데이터 준비 및 두 단계 i-vector/빔포밍 강화 디코딩으로 디코딩.
트랙 2의 다이어라이제이션 파이프라인은 x-vectors(TDNN)와 PLDA 스코어링 및 AHC를 사용하고 RTTM 기반 평가를 추가한다.

실험 결과

연구 질문

RQ1세그먼트되지 않은 다중 화자 녹음에서 다이어라이제이션이 ASR 성능에 미치는 영향은?
RQ2동기화, 향상, 다이어라이제이션 및 ASR에 대한 재현 가능한 오픈 소스 베이스라인이 CHiME-6 스타일 작업에 대한 진입 장벽을 낮출 수 있는가?
RQ3실제 가정 환경에서 세그먼트된 ASR과 세그먼트되지 않은 다중 화자 ASR 간의 베이스라인 성능 격차는 무엇인가?

주요 결과

트랙 1 베이스라인 ASR WER: DEV 51.8%, EVAL 51.3%.
트랙 2 베이스라인 SAD 결과(주석 RTTM): DEV DER 61.6%, JER 69.8%; EVAL DER 62.0%, JER 71.4%.
트랙 2 베이스라인 SAD 결과(Alignment RTTM): DEV DER 63.4%, JER 70.8%; EVAL DER 68.2%, JER 72.5%.
트랙 1 BeamformIt를 사용한 향상: DEV 69.8%, EVAL 61.2%.
트랙 1 GSS 향상: DEV 51.8%, EVAL 51.3%.
트랙 2 BeamformIt 향상: DEV 84.3%, EVAL 77.9%.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.