QUICK REVIEW

[논문 리뷰] DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement

Yanxin Hu, Yun Liu|arXiv (Cornell University)|2020. 08. 01.

Speech and Audio Processing참고 문헌 32인용 수 59

한 줄 요약

복소수 값을 이용한 위상 인식 단일 채널 음성 향상을 위한 Deep Complex Convolution Recurrent Network (DCCRN)을 소개하며, 소형 모델로도 강력한 PESQ/MOS를 달성한다.

ABSTRACT

Speech enhancement has benefited from the success of deep learning in terms of intelligibility and perceptual quality. Conventional time-frequency (TF) domain methods focus on predicting TF-masks or speech spectrum, via a naive convolution neural network (CNN) or recurrent neural network (RNN). Some recent studies use complex-valued spectrogram as a training target but train in a real-valued network, predicting the magnitude and phase component or real and imaginary part, respectively. Particularly, convolution recurrent network (CRN) integrates a convolutional encoder-decoder (CED) structure and long short-term memory (LSTM), which has been proven to be helpful for complex targets. In order to train the complex target more effectively, in this paper, we design a new network structure simulating the complex-valued operation, called Deep Complex Convolution Recurrent Network (DCCRN), where both CNN and RNN structures can handle complex-valued operation. The proposed DCCRN models are very competitive over other previous networks, either on objective or subjective metric. With only 3.7M parameters, our DCCRN models submitted to the Interspeech 2020 Deep Noise Suppression (DNS) challenge ranked first for the real-time-track and second for the non-real-time track in terms of Mean Opinion Score (MOS).

연구 동기 및 목표

복소수 CRN을 활용하여 크기 및 위상 정보를 모두 모델링하여 음성 향상을 개선한다.
모델 크기와 계산 복잡성을 줄이면서 지각 품질을 유지하거나 향상시킨다.
Phase-aware 타깃으로 DNS Challenge 실시간 및 비실시간 트랙에서 우수한 성능을 입증한다.

제안 방법

복소수 인코더/디코더와 복소수 LSTM을 사용한 Deep Complex Convolution Recurrent Network 설계.
복소수 컨벌루션, 복소수 배치 정규화, 복소수 LSTM을 활용하여 복소수 연산을 시뮬레이션한다.
복소수 CRM 또는 크기 마스크를 타깃으로 하는 신호 근사 손실로 학습하고, 시간 영역에서 SI-SNR를 최적화한다.
WSJ0-시뮬레이션 데이터와 DNS Challenge 데이터에서 네 가지 DCCRN 변형(R, C, E, CL)과 기본 CRN/DCUNET을 비교한다.
훈련 중 파형 합성에 STFT/iSTFT를 사용하고 손실 함수로 SI-SNR을 사용한다.]
research_questions: [

실험 결과

연구 질문

RQ1완전한 복소수 CRN이 실수 값 또는 크기만 타깃에 비해 위상 인식 음성 향상을 개선하는가?
RQ2다양한 DCCRN 타깃 표현(R, C, E, CL)이 객관적(PESQ) 및 주관적(MOS) 성능에 어떤 영향을 미치는가?
RQ3WSJ0 및 DNS Challenge 데이터셋에서 모델 크기, 실시간 가능성, 향상 품질 간의 트레이드오프는 무엇인가?

주요 결과

DCCRN 변형이 시뮬레이션된 WSJ0 데이터셋의 PESQ에서 LSTM 및 CRN 베이스라인을 능가한다.
DCCRN-E는 실시간 트랙에서 강한 DNS Challenge MOS를 달성하고 비실시간 트랙에서도 좋은 성능을 보이며; DCCRN-CL은 추가 PESQ 이점을 제공하지만 일부 클립에서 과도한 억제 현상을 유발할 수 있다.
WSJ0 및 DNS 데이터 전반에서 DCCRN 모델은 DCUNET에 근접한 PESQ를 달성하되 파라미터 수와 계산량은 크게 줄어들었다(DCUNET는 DCCRN-CL보다 약 6배 무겁다).
DCCRN-E-Aug(더 많은 잔향 학습 데이터를 사용)으로 잔향 케이스에서 MOS 이득이 증가한다.
최종 주관적 평가에서 DCCRN-E가 평균 MOS 약 3.42(무잔향/잔향 혼합 없음)로, 프레임당 3.12 ms를 데스크탑 CPU/GPU 설정에서 달성한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.