Skip to main content
QUICK REVIEW

[论文解读] TasNet: Surpassing Ideal Time-Frequency Masking for Speech Separation.

Yi Luo, Nima Mesgarani|arXiv (Cornell University)|Sep 20, 2018
Speech and Audio Processing参考文献 47被引用 74
一句话总结

TasNet 提出了一种端到端、时域的深度学习框架用于语音分离,该框架绕过了时频表示,采用卷积编码器、通过空洞时间卷积学习的掩码以及线性解码器。其性能超越了理想时频掩码,且延迟更低、模型更小,从而实现了实时且精确的说话人分离。

ABSTRACT

Robust speech processing in multitalker acoustic environments requires automatic speech separation. While single-channel, speaker-independent speech separation methods have recently seen great progress, the accuracy, latency, and computational cost of speech separation remain insufficient. The majority of the previous methods have formulated the separation problem through the time-frequency representation of the mixed signal, which has several drawbacks, including the decoupling of the phase and magnitude of the signal, the suboptimality of spectrogram representations for speech separation, and the long latency in calculating the spectrogram. To address these shortcomings, we propose the time-domain audio separation network (TasNet), which is a deep learning autoencoder framework for time-domain speech separation. TasNet uses a convolutional encoder to create a representation of the signal that is optimized for extracting individual speakers. Speaker extraction is achieved by applying a weighting function (mask) to the encoder output. The modified encoder representation is then inverted to the sound waveform using a linear decoder. The masks are found using a temporal convolutional network consisting of dilated convolutions, which allow the network to model the long-term dependencies of the speech signal. This end-to-end speech separation algorithm significantly outperforms previous time-frequency methods in terms of separating speakers in mixed audio, even when compared to the separation accuracy achieved with the ideal time-frequency mask of the speakers. In addition, TasNet has a smaller model size and a shorter minimum latency, making it a suitable solution for both offline and real-time speech separation applications. This study therefore represents a major step toward actualizing speech separation for real-world speech processing technologies.

研究动机与目标

  • 为克服语音分离中时频表示的局限性,如相位-幅度解耦和高延迟。
  • 开发一种适用于实际部署的实时、低延迟语音分离系统。
  • 通过端到端深度学习方法,实现分离精度超越理想时频掩码的性能。
  • 与现有基于时频的方法相比,降低计算成本和模型大小。

提出的方法

  • TasNet 使用卷积编码器将原始波形转换为针对说话人分离优化的表示。
  • 通过具有空洞卷积的时序卷积网络,应用可学习的、时间连续的掩码,以提取各个说话人的成分。
  • 使用线性解码器将掩码表示重构为波形,从而实现端到端训练。
  • 空洞卷积允许在不过度增加感受野大小的情况下建模长期语音依赖性。
  • 整个系统通过最小化估计波形与目标波形之间差异的损失函数进行端到端训练。
  • 该方法避免了谱图计算,消除了相位-幅度解耦问题并降低了延迟。

实验结果

研究问题

  • RQ1端到端时域方法能否超越基于时频的方法在语音分离中的性能?
  • RQ2在时域训练的模型能否实现超越理想时频掩码的性能?
  • RQ3时域系统能否实现比现有方法更低的延迟和更小的模型尺寸?
  • RQ4空洞卷积在说话人分离中对长期语音依赖性的建模效果如何?

主要发现

  • 与最先进时频方法相比,TasNet 实现了更优的语音分离性能,甚至在某些情况下超越了理想时频掩码。
  • 由于直接处理波形,该模型相比谱图方法表现出显著更低的延迟。
  • TasNet 拥有更小的模型尺寸,适用于实时和资源受限的应用。
  • 空洞卷积的使用使得在时域中有效建模长期语音依赖性成为可能。
  • 在时域中进行端到端训练消除了对相位重建的需求,并避免了次优的谱图表示。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。