Skip to main content
QUICK REVIEW

[论文解读] KUIELab-MDX-Net: A Two-Stream Neural Network for Music Demixing

Minseok Kim, Woosung Choi|arXiv (Cornell University)|Nov 24, 2021
Speech and Audio Processing参考文献 15被引用 31
一句话总结

KUIELab-MDX-Net 提出一个两流音乐 demixing 模型,包含时频和时域分支,融合其输出以在 MDX 2021 基准上实现强 SDR。

ABSTRACT

Recently, many methods based on deep learning have been proposed for music source separation. Some state-of-the-art methods have shown that stacking many layers with many skip connections improve the SDR performance. Although such a deep and complex architecture shows outstanding performance, it usually requires numerous computing resources and time for training and evaluation. This paper proposes a two-stream neural network for music demixing, called KUIELab-MDX-Net, which shows a good balance of performance and required resources. The proposed model has a time-frequency branch and a time-domain branch, where each branch separates stems, respectively. It blends results from two streams to generate the final estimation. KUIELab-MDX-Net took second place on leaderboard A and third place on leaderboard B in the Music Demixing Challenge at ISMIR 2021. This paper also summarizes experimental results on another benchmark, MUSDB18. Our source code is available online.

研究动机与目标

  • Motivate a resource-efficient yet high-performing music source separation model.
  • Design a two-stream architecture combining time-frequency and time-domain approaches for separate stems.
  • Reduce computational load compared to state-of-the-art deep architectures while maintaining SDR performance.
  • Demonstrate effectiveness on the MDX challenge and validate on MUSDB18.

提出的方法

  • Implement a time-frequency branch using a TFC-TDF-U-Net v2 with architectural simplifications (multiplicative skip connections, removal of most skip paths).
  • Incorporate a time-domain branch based on pretrained Demucs without fine-tuning to provide an additional source estimate.
  • Add a Mixer network to fuse independently estimated sources and the mixture to refine final outputs.
  • Apply source-specific preprocessing including frequency cutting to extend effective n_fft within time limits.
  • Train four single-target separation models per source and then train the Mixer with frozen separation models.
  • Blend outputs from the two streams via a weighted average to generate final estimates.

实验结果

研究问题

  • RQ1Can a two-stream architecture (time-frequency and time-domain) achieve competitive SDR with reduced resources for music demixing?
  • RQ2What architectural and preprocessing adjustments yield a favorable balance between performance and computation time for MDX-compliant models?
  • RQ3Does a Mixer component improve separation by exploiting cross-source information within the mixture?

主要发现

ModelVocals SDRDrums SDRBass SDROther SDR
TFC-TDF-U-Net v1 (Choi et al., 2020)7.986.115.945.02
X-UMX (Sawata et al., 2021)6.616.475.434.64
Demucs (Défossez et al., 2021)6.846.867.014.42
D3Net (Takahashi & Mitsufuji, 2021)7.247.015.254.53
ResUNetDecouple+ (Kong et al., 2021)8.986.626.045.29
TFC-TDF-U-Net v28.816.527.655.70
v2 + Mixer8.917.077.335.81
v2 + Demucs8.807.148.115.90
KUIELab-MDX-Net9.007.337.865.95
  • KUIELab-MDX-Net achieves SDR competitive with or superior to several SOTA models on MUSDB18 across most instruments.
  • v2 (time-frequency) with Mixer outperforms several prior methods, achieving best SDR for vocals, drums, and other, with bass close to SOTA.
  • Incorporating a time-domain branch and a Mixer provides additional gains over single-stream approaches.
  • The model ranks second on Leaderboard A and third on Leaderboard B in the MDX 2021 challenge.
  • The approach demonstrates strong performance while using a downsized architecture relative to some deep baselines.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。