Skip to main content
QUICK REVIEW

[Paper Review] SPMamba: State-space model is all you need in speech separation

Kai Li, Chen Guo|arXiv (Cornell University)|Apr 2, 2024
Speech Recognition and Synthesis8 citations
TL;DR

SPMamba replaces TF-GridNet's Transformer with a bidirectional Mamba module, achieving state-of-the-art speech separation with fewer parameters and lower computational cost, on Librispeech-based data with noise and reverberation.

ABSTRACT

Existing CNN-based speech separation models face local receptive field limitations and cannot effectively capture long time dependencies. Although LSTM and Transformer-based speech separation models can avoid this problem, their high complexity makes them face the challenge of computational resources and inference efficiency when dealing with long audio. To address this challenge, we introduce an innovative speech separation method called SPMamba. This model builds upon the robust TF-GridNet architecture, replacing its traditional BLSTM modules with bidirectional Mamba modules. These modules effectively model the spatiotemporal relationships between the time and frequency dimensions, allowing SPMamba to capture long-range dependencies with linear computational complexity. Specifically, the bidirectional processing within the Mamba modules enables the model to utilize both past and future contextual information, thereby enhancing separation performance. Extensive experiments conducted on public datasets, including WSJ0-2Mix, WHAM!, and Libri2Mix, as well as the newly constructed Echo2Mix dataset, demonstrated that SPMamba significantly outperformed existing state-of-the-art models, achieving superior results while also reducing computational complexity. These findings highlighted the effectiveness of SPMamba in tackling the intricate challenges of speech separation in complex environments.

Motivation & Objective

  • Motivate the use of state-space models (SSMs) to address long-sequence speech separation limitations in CNN- and Transformer-based methods.
  • Propose SPMamba by substituting Transformer components in TF-GridNet with bidirectional Mamba modules.
  • Demonstrate improved separation performance and efficiency on a Librispeech-based dataset with noise and reverberation.

Proposed method

  • Adopt TF-GridNet as the base framework and replace BLSTM/Transformer components with BMamba for bidirectional context.
  • Introduce BMamba to process forward and backward sequences, enabling non-causal, BLSTM-like information aggregation.
  • Structure SPMamba with a time-domain module, a frequency-domain module, and a time-frequency attention module, following TF-GridNet design but with BMamba layers.
  • Train with Permutation Invariant Training (PIT) using an SNR loss to optimize source separation quality.
  • Evaluate using SI-SNRi and SDRi, and compare parameter count and MACs with state-of-the-art models.

Experimental results

Research questions

  • RQ1Does SPMamba surpass TF-GridNet and other baselines in SDRi and SI-SNRi on a challenging noisy/reverberant dataset?
  • RQ2Can bidirectional Mamba effectively replace Transformer components to maintain or improve performance with fewer parameters and lower compute?
  • RQ3What is the relative efficiency (Parameters and MACs) of SPMamba compared to TF-GridNet and other leading models?
  • RQ4How does BMamba contribute to modeling long-range dependencies in both time and frequency domains within the TF-GridNet framework?

Key findings

  • SPMamba achieves SDR 16.01 dB and SI-SNRi 15.20 dB, outperforming TF-GridNet by 2.42 dB and 2.58 dB respectively.
  • SPMamba uses 6.14M parameters and 78.69 GMACs/s, substantially fewer parameters and lower compute than TF-GridNet (14.43M params, 445.56 GMACs/s).
  • On a Librispeech-based dataset with noise and reverberation, SPMamba delivers state-of-the-art performance among tested models.
  • Replacing Transformer with bidirectional Mamba maintains high performance while reducing computational demands.
  • The model demonstrates the significance of Mamba-based architectures for long-sequence audio processing in speech separation.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.