QUICK REVIEW

[논문 리뷰] RIR-Mega-Speech: A Reverberant Speech Corpus with Comprehensive Acoustic Metadata and Reproducible Evaluation

Mandip Goswami|arXiv (Cornell University)|2026. 01. 25.

Speech and Audio Processing인용 수 0

한 줄 요약

본 논문은 per-file RT60, DRR 및 C50 주석이 있는 대형 잔향 음성 코퍼스인 RIR-Mega-Speech와 LibriSpeech 및 RIR-Mega의 시뮬레이티드 RIR을 사용한 재현 가능한 생성 및 평가 스크립트를 도입합니다. 잔향 하에서 Whisper small의 WER 증가를 보고하고 정확한 재현을 위한 도구를 제공합니다.

ABSTRACT

Despite decades of research on reverberant speech, comparing methods remains difficult because most corpora lack per-file acoustic annotations or provide limited documentation for reproduction. We present RIR-Mega-Speech, a corpus of approximately 117.5 hours created by convolving LibriSpeech utterances with roughly 5,000 simulated room impulse responses from the RIR-Mega collection. Every file includes RT60, direct-to-reverberant ratio (DRR), and clarity index ($C_{50}$) computed from the source RIR using clearly defined, reproducible procedures. We also provide scripts to rebuild the dataset and reproduce all evaluation results. Using Whisper small on 1,500 paired utterances, we measure 5.20% WER (95% CI: 4.69--5.78) on clean speech and 7.70% (7.04--8.35) on reverberant versions, corresponding to a paired increase of 2.50 percentage points (2.06--2.98). This represents a 48% relative degradation. WER increases monotonically with RT60 and decreases with DRR, consistent with prior perceptual studies. While the core finding that reverberation harms recognition is well established, we aim to provide the community with a standardized resource where acoustic conditions are transparent and results can be verified independently. The repository includes one-command rebuild instructions for both Windows and Linux environments.

연구 동기 및 목표

각 파일에 대한 RT60, DRR, C50의 기계학습 기준 주석이 있는 표준화되고 재현 가능한 잔향 음성 코퍼스 제공.
오픈 소스 스크립트를 이용한 오디오, 지표 및 평가 결과의 정확한 재생성 가능성.
잔향으로 인한 ASR 저하를 정량화하고 현대 모델(Whisper small)을 활용한 음향 매개변수 간의 추세 분석.
발화자에 따라 stratified된 학습/개발/테스트 분할을 제공하여 강건한 모델 평가 및 공정한 비교 가능성 확보

제안 방법

RIR-Mega 컬렉션의 약 5,000개 RIR을 LibriSpeech dev-clean 및 test-clean 발화에 컨볼브하여 총 약 117.5시간의 약 53,230개의 잔향 파일 생성.
컨볼루션 전 출처 RIR들로부터 RT60(Schroeder 역적분), DRR(직접 윈도우 2.5 ms), C50를 계산하고 이를 보편 메타데이터 CSV에 저장.
오디오 재구축, 지표 계산 및 모든 평가 결과 재현에 필요한 전체 코드를 제공하며 Windows 및 Linux에서 한 번의 명령으로 실행 가능한 스크립트 제공.
Whisper small을 사용하여 1,500쌍의 깨끗한-잔향 발화를 기반으로 WER를 얻고 쌍 WER 및 부트스트랩 신뢰구간 평가.
볼륨 정규화와 가감 잡음을 이용한 제거 실험을 수행하여 견고성 및 지각적 관련성 평가

Figure 1: RT60 distribution across all reverberant files. Most files fall between 0.2 and 0.8 seconds.

실험 결과

연구 질문

RQ1잔향이 쌍으로 제공된 깨끗한 발화 대비 ASR 성능(WER)에 어떤 영향을 미치는가?
RQ2RT60 및 DRR이 WER에 어떤 영향을 주며 잔향 조건에서 이들의 상호작용은 어떠한가?
RQ3정규화나 가감 잡음이 이 코퍼스에서 잔향이 WER에 미치는 영향을 수정하는가?
RQ4Per-file 음향 메타데이터를 갖춘 재현 가능한 기반을 제공하여 원고자처 원고 재구성 및 강건한 ASR 연구에 활용 가능한가?

주요 결과

Whisper small은 깨끗한 음성에서 5.20% WER, 잔향 버전에서 7.70% WER을 보이며 1,500쌍 발화에 대해(95% 신뢰구간: 4.69–5.78 및 7.04–8.35 각각).
잔향으로 인한 Paired WER 증가가 2.50 포인트(95% CI: 2.06–2.98)로 상대적으로 48.2% 증가.
WER은 RT60이 증가하면 단조롭게 상승(0.2–0.4초에서 약 6%, 1.0–1.2초에서 약 10%까지)하고 DRR이 높아질수록 감소하며 고 DRR에서 깨끗한 수준으로 수렴.
500발화 제거 연구에서 음량 정규화의 WER 영향은 결정적이지 않으며, SNR 10–15 dB의 백색 잡음을 추가하면 WER이 약 31%까지 급격히 증가.
오류 분석에 따르면 가장 도전적인 경우는 RT60 >0.8 s 및 DRR < -5 dB에서 발생하며, 대부분의 오류는 음소 치환 또는 기능어 생략이다.

Figure 2: DRR distribution using a 2.5 ms direct-only window. The long tail toward negative values reflects weak direct arrivals in some simulated RIRs.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.