QUICK REVIEW

[论文解读] Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement

Mostafa Sadeghi, Xavier Alameda-Pineda|arXiv (Cornell University)|Dec 23, 2019

Speech and Audio Processing参考文献 46被引用 24

一句话总结

本文提出了一种用于变分自编码器的推理网络混合模型（MIN-VAE），通过将音频与视觉模态的推理解耦，提升音视频语音增强性能。该方法采用两个模态特定的编码器，结合共享解码器与可学习的潜在混合变量，实现对音频与视觉信息的自适应融合，从而在无需训练阶段噪声数据的情况下，实现鲁棒的无监督语音增强，性能优于仅音频输入和标准音视频VAE基线模型。

ABSTRACT

In this paper, we are interested in unsupervised (unknown noise) audio-visual speech enhancement based on variational autoencoders (VAEs), where the probability distribution of clean speech spectra is simulated using an encoder-decoder architecture. The trained generative model (decoder) is then combined with a noise model at test time to estimate the clean speech. In the speech enhancement phase (test time), the initialization of the latent variables, which describe the generative process of clean speech via decoder, is crucial, as the overall inference problem is non-convex. This is usually done by using the output of the trained encoder where the noisy audio and clean visual data are given as input. Current audio-visual VAE models do not provide an effective initialization because the two modalities are tightly coupled (concatenated) in the associated architectures. To overcome this issue, inspired by mixture models, we introduce the mixture of inference networks variational autoencoder (MIN-VAE). Two encoder networks input, respectively, audio and visual data, and the posterior of the latent variables is modeled as a mixture of two Gaussian distributions output from each encoder network. The mixture variable is also latent, and therefore the inference of learning the optimal balance between the audio and visual inference networks is unsupervised as well. By training a shared decoder, the overall network learns to adaptively fuse the two modalities. Moreover, at test time, the visual encoder, which takes (clean) visual data, is used for initialization. A variational inference approach is derived to train the proposed generative model. Thanks to the novel inference procedure and the robust initialization, the proposed MIN-VAE exhibits superior performance on speech enhancement than using the standard audio-only as well as audio-visual counterparts.

研究动机与目标

为解决标准音视频VAE中，音频与视觉编码器紧密耦合导致推理过程中潜在变量初始化不理想的问题。
通过在训练过程中无需噪声数据，实现音频与视觉模态的自适应融合，从而提升无监督音视频语音增强性能。
构建一种变分推理框架，联合学习模态特定的推理网络与潜在变量后验估计的混合机制。
在测试阶段利用干净的视觉数据初始化潜在空间，提升非凸优化下的推理稳定性和性能。
通过生成建模与鲁棒多模态推理的结合，在未知噪声条件下实现语音增强的最先进性能。

提出的方法

提出一种推理网络混合模型（MIN-VAE），其中两个独立的编码器分别处理音频与视觉输入，为潜在变量生成两个高斯后验分布。
将潜在变量的后验分布建模为两个高斯分布的混合，混合权重通过一个潜在伯努利变量学习得到，实现模态间的无监督自适应。
使用共享解码器从潜在变量重建干净的语音谱图，确保跨模态的一致性生成建模。
采用类似EM算法的变分推理：E步使用马尔可夫链蒙特卡洛中的Metropolis-Hastings采样近似潜在变量的后验分布，初始值由视觉编码器输出提供。
M步通过乘法更新方式优化模型参数（解码器权重、噪声参数与混合先验），以最大化变分下界。
在测试阶段，使用视觉编码器的输出初始化潜在变量，即使在音频噪声较大的情况下也能实现鲁棒推理。

实验结果

研究问题

RQ1在VAE框架中解耦音频与视觉推理，是否能改善潜在变量初始化并提升语音增强性能？
RQ2可学习的推理网络混合机制是否相比拼接编码器，能实现更优的音频与视觉模态自适应融合？
RQ3所提出的MIN-VAE是否能在未知噪声条件下实现优于基线的无监督音视频语音增强性能？
RQ4利用视觉数据进行潜在变量初始化，对推理稳定性和重建质量有何影响？
RQ5与监督和标准无监督基线相比，该方法在未见噪声类型上的泛化能力如何？

主要发现

所提出的MIN-VAE在未见噪声类型下，相比仅音频VAE与标准音视频VAE基线，均展现出更优的语音增强性能。
通过利用视觉编码器输出进行初始化，即使视觉数据存在退化或非对齐情况，模型仍对噪声或退化视觉输入保持鲁棒性。
采用模态特定编码器与可学习混合机制，使推理阶段能更有效且自适应地融合音频与视觉信息。
结合Metropolis-Hastings采样的变分推理过程，即使在非凸优化挑战下，也能实现稳定的后验近似。
定量结果表明，与基线相比，PESQ与STOI得分显著提升，尤其在低信噪比（SNR）条件下，验证了所提架构的有效性。
由于无监督训练范式与解耦的模态表征学习，模型对未见噪声类型的泛化能力出色。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。