QUICK REVIEW

[논문 리뷰] VoiceFixer: Toward General Speech Restoration with Neural Vocoder

Haohe Liu, Qiuqiang Kong|arXiv (Cornell University)|2021. 09. 28.

Speech Recognition and Synthesis참고 문헌 76인용 수 25

한 줄 요약

VoiceFixer는 멜-스펙트로그램 분석 단계와 신경 보코더 합성 단계를 결합한 두 단계의 일반 음성 복원(GSR) 프레임워크를 도입하여 다양한 왜곡에 대해 단일 태스크 SSR 기본 모델보다 MOS를 향상시켰다.

ABSTRACT

Speech restoration aims to remove distortions in speech signals. Prior methods mainly focus on single-task speech restoration (SSR), such as speech denoising or speech declipping. However, SSR systems only focus on one task and do not address the general speech restoration problem. In addition, previous SSR systems show limited performance in some speech restoration tasks such as speech super-resolution. To overcome those limitations, we propose a general speech restoration (GSR) task that attempts to remove multiple distortions simultaneously. Furthermore, we propose VoiceFixer, a generative framework to address the GSR task. VoiceFixer consists of an analysis stage and a synthesis stage to mimic the speech analysis and comprehension of the human auditory system. We employ a ResUNet to model the analysis stage and a neural vocoder to model the synthesis stage. We evaluate VoiceFixer with additive noise, room reverberation, low-resolution, and clipping distortions. Our baseline GSR model achieves a 0.499 higher mean opinion score (MOS) than the speech enhancement SSR model. VoiceFixer further surpasses the GSR baseline model on the MOS score by 0.256. Moreover, we observe that VoiceFixer generalizes well to severely degraded real speech recordings, indicating its potential in restoring old movies and historical speeches. The source code is available at https://github.com/haoheliu/voicefixer_main.

연구 동기 및 목표

일반 음성 복원(GSR) 태스크를 동기화된 단일 모델로 여러 왜곡을 복원하는 동기화를 목표로 정의한다.
향상된 복원 품질을 위해 인간의 청각 처리방식을 모방하는 두 단계 프레임워크 VoiceFixer를 제안한다.
Mel-스펙트로그램 표현을 통한 분석 및 합성 단계를 분리함으로써 다양한 왜곡에서 강력한 성능을 달성한다.

제안 방법

두 단계 아키텍처: 분석은 왜곡된 오디오를 멜-스펙트로그램 표현으로 매핑하고, 합성은 멜 입력으로부터 파형을 생성하는 신경 보코더를 사용한다.
분석 단계는 Mel-filtered 입력으로부터 멜 스펙트로그램을 복원하도록 ResUNet으로 모델링한다.
합성 단계는 적대적 손실과 다중 도메인 스펙트로템-시간 손실로 학습된 비자기회귀 보코더(TFGAN)를 사용한다.
학습 손실은 멜 복원에 대한 MAE와 보코더에 대한 시간 영역 및 주파수 영역 손실의 조합을 포함한다.
판별자는 다중 해상도 시간, 서브밴드 및 주파수 판별기를 포함하여 보코더 학습을 안내한다.
보코더 손실은 L_F(멜 손실 및 다중 해상도 스펙트로그램 손실)과 L_T(세그먼트, 에너지, 위상)를 적대적 구성요소(L_D)와 결합한 형태로 구성된다.

실험 결과

연구 질문

RQ1일반 음성 복원(GSR) 모델이 하나의 프레임워크로 다중 왜곡을 복원할 수 있는가?
RQ2두 단계 VoiceFixer 아키텍처가 다양한 왜곡에서 MOS 및 주관적 지표에 대해 단일 단계 SSR 기본 모델보다 우수한가?
RQ3저샘플링 속도에서 VoiceFixer가 분석-합성 결합 품질을 유지하는 정도는 어떤가?
RQ4다른 분석 아키텍처(ResUNet 대 DNN/BiGRU)가 복원 품질에 미치는 영향은 무엇인가?
RQ5대규모 음성 데이터로 학습된 신경 보코더가 복원 성능에 어떤 기여를 하는가?

주요 결과

VoiceFixer(VF)와 UNet 기반 분석이 ALL-GSR에서 평가된 시스템들 중에서 MOS 및 LSD에서 최고를 기록했다.
VF-UNet은 ALL-GSR에서 GSR-UNet 대비 MOS를 0.256 향상시켰다.
VF-UNet의 MOS는 Oracle-Mel 상한선에서 단 0.11 낮아 분석 단계의 강력한 성능을 시사한다.
VF는 2–8 kHz로 업샘플링해 44.1 kHz로 처리하는 저샘플링-레이트 초해상도 작업에서 강한 성능을 보여 여러 SSR 모델을 능가한다.
ALL-GSR 세트 전반에서 GSR-UNet이 SSR 기본 모델보다 일반적으로 더 나은 성능을 보이며 VoiceFixer가 주관적 품질을 더욱 향상시킨다.
보코딩 기반 합성은 대량의 음성 데이터로 학습된 신경 보코더로부터 얻은 사전지식과 저차원 입력으로 이점을 얻는다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.