QUICK REVIEW

[论文解读] Improved Speech Enhancement with the Wave-U-Net

Craig Macartney, Tillman Weyde|arXiv (Cornell University)|Nov 27, 2018

Speech and Audio Processing参考文献 10被引用 56

一句话总结

本文将 Wave-U-Net 时域架构应用于语音增强，在 Voice Bank/VCTK 设置下相较于以往方法在客观指标上有所提升，且用于语音的网络规模较小即可取得良好性能，而用于歌曲人声分离则需要更大规模的网络。

ABSTRACT

We study the use of the Wave-U-Net architecture for speech enhancement, a model introduced by Stoller et al for the separation of music vocals and accompaniment. This end-to-end learning method for audio source separation operates directly in the time domain, permitting the integrated modelling of phase information and being able to take large temporal contexts into account. Our experiments show that the proposed method improves several metrics, namely PESQ, CSIG, CBAK, COVL and SSNR, over the state-of-the-art with respect to the speech enhancement task on the Voice Bank corpus (VCTK) dataset. We find that a reduced number of hidden layers is sufficient for speech enhancement in comparison to the original system designed for singing voice separation in music. We see this initial result as an encouraging signal to further explore speech enhancement in the time-domain, both as an end in itself and as a pre-processing step to speech recognition systems.

研究动机与目标

Motivate and evaluate end-to-end time-domain speech enhancement using Wave-U-Net to jointly model waveform and phase information.
Investigate whether Wave-U-Net can outperform state-of-the-art speech enhancement methods on standard benchmarks.
Assess the impact of network size on performance for speech enhancement.
Compare Wave-U-Net with Wiener filtering and SEGAN baselines to establish effectiveness for speech enhancement tasks.

提出的方法

Adopt the Wave-U-Net architecture, a 1D U-Net with downsampling and upsampling blocks, to predict two sources from a monaural mixture.
Formulate source prediction per sample via a 1D convolution with K·C filters, followed by a tanh nonlinearity to constrain outputs to [-1,1].
Use LeakyReLU activations in all layers except final outputs.
Train on randomly sampled audio excerpts using ADAM with a learning rate of 1e-4, batch size 16, and early stopping based on a validation set.
Fine-tune the best model with doubled batch size and reduced learning rate (1e-5) for up to 20 epochs without validation improvement.

实验结果

研究问题

RQ1Does time-domain Wave-U-Net improve speech enhancement metrics over state-of-the-art methods on the Voice Bank/VCTK dataset?
RQ2What is the impact of network depth on Wave-U-Net performance for speech enhancement, and is a smaller model sufficient?
RQ3How does Wave-U-Net compare to Wiener filtering and SEGAN in objective speech quality and intelligibility metrics?
RQ4Can the Wave-U-Net architecture be effectively tuned for speech enhancement and potentially extended to multi-channel/multi-source settings?

主要发现

Wave-U-Net outperforms Wiener filtering and SEGAN on PESQ, CSIG, CBAK, COVL, and SSNR metrics for speech enhancement.
The best Wave-U-Net configuration in experiments is a 10-layer model with fine-tuning achieving the top results among tested variants.
Without fine-tuning, the 9- and 10-layer Wave-U-Nets perform best, suggesting the optimal receptive field for speech is smaller than that used for music source separation.
Wave-U-Net yields higher SSNR (9.97) compared to baselines (Noisy 1.68, Wiener 5.07, SEGAN 7.73).
Fewer hidden layers are sufficient for speech enhancement than for singing voice separation tasks.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。