QUICK REVIEW

[论文解读] TasNet: time-domain audio separation network for real-time, single-channel speech separation

Yi Luo, Nima Mesgarani|arXiv (Cornell University)|Nov 1, 2017

Speech and Audio Processing参考文献 19被引用 37

一句话总结

TasNet 提出了一种实时、单通道语音分离系统，通过时间域编码器-解码器框架直接在原始波形上运行，绕过了基于 STFT 的时频表示。通过将信号建模为学习到的基信号的非负组合，并在编码器输出上估计源掩码，TasNet 实现了最先进的性能，总延迟仅为 5.23 ms，在因果和非因果设置下均显著优于基于 STFT 的方法。

ABSTRACT

Robust speech processing in multi-talker environments requires effective speech separation. Recent deep learning systems have made significant progress toward solving this problem, yet it remains challenging particularly in real-time, short latency applications. Most methods attempt to construct a mask for each source in time-frequency representation of the mixture signal which is not necessarily an optimal representation for speech separation. In addition, time-frequency decomposition results in inherent problems such as phase/magnitude decoupling and long time window which is required to achieve sufficient frequency resolution. We propose Time-domain Audio Separation Network (TasNet) to overcome these limitations. We directly model the signal in the time-domain using an encoder-decoder framework and perform the source separation on nonnegative encoder outputs. This method removes the frequency decomposition step and reduces the separation problem to estimation of source masks on encoder outputs which is then synthesized by the decoder. Our system outperforms the current state-of-the-art causal and noncausal speech separation algorithms, reduces the computational cost of speech separation, and significantly reduces the minimum required latency of the output. This makes TasNet suitable for applications where low-power, real-time implementation is desirable such as in hearable and telecommunication devices.

研究动机与目标

为解决基于 STFT 的语音分离的局限性，例如相位-幅度解耦和因长时窗导致的高延迟。
实现实时、低延迟语音分离，适用于可穿戴设备和电信设备。
探究通过时间域神经网络直接建模波形是否能超越频域方法。
通过消除 STFT 和逆 STFT 处理，降低计算成本并提升分离性能。

提出的方法

系统使用一维卷积编码器，通过学习到的基信号将原始波形转换为非负加权表示。
通过在编码器输出上估计源掩码来执行语音分离，这些掩码表示每位说话人对混合信号权重的贡献。
使用一维转置卷积解码器从掩码后的编码器输出重建分离后的波形。
网络采用非负自编码器框架进行训练，对重建波形使用 L1 损失。
因果推理使用单向 LSTM；非因果推理使用双向 LSTM 以提升性能。
基信号端到端学习，其频率响应类似于梅尔滤波器组，且在低频段具有更高的分辨率。

实验结果

研究问题

RQ1直接对原始波形进行时间域建模是否能在性能和延迟方面超越传统的基于 STFT 的语音分离？
RQ2消除 STFT 步骤是否能减少与相位相关的伪影并提升分离质量？
RQ3时间域系统是否能实现极低延迟的实时处理，适用于助听器和电信设备？
RQ4所学习的基表示在频谱分辨率和说话人分离能力方面与传统滤波器组相比如何？
RQ5使用非负编码器输出对源掩码估计的稳定性和性能有何影响？

主要发现

TasNet-LSTM 在 WSJ0-2mix 数据集上达到 SI-SNRi 7.7 dB 和 SDRi 8.0 dB，相比之前的最先进因果系统（uPIT-LSTM）在 SI-SNRi 上提升 0.7 dB。
TasNet-BLSTM 达到 SI-SNRi 10.8 dB 和 SDRi 11.1 dB，超越所有先前系统，包括 uPIT-BLSTM-ST 等两阶段方法。
TasNet-LSTM 的总系统延迟仅为 5.23 ms，由 5 ms 初始延迟和每段 0.23 ms 处理时间组成，显著低于基于 STFT 系统所需的 32 ms 最小延迟。
TasNet 学习到的基信号表现出连续的频率响应，低频段分辨率更高，其中 60% 的中心频率低于 1 kHz。
与最先进的基于 STFT 的系统相比，系统实现了 6 倍的加速，Titan X GPU 上每段处理时间低于 0.23 ms。
TasNet 在未使用如循环 dropout 或后聚类步骤等正则化技术的情况下仍取得优异性能，表明其架构具有内在鲁棒性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。