QUICK REVIEW

[论文解读] KUIELab-MDX-Net: A Two-Stream Neural Network for Music Demixing

Minseok Kim, Woosung Choi|arXiv (Cornell University)|Nov 24, 2021

Speech and Audio Processing参考文献 15被引用 31

一句话总结

KUIELab-MDX-Net 提出一个两流音乐 demixing 模型，包含时频和时域分支，融合其输出以在 MDX 2021 基准上实现强 SDR。

ABSTRACT

Recently, many methods based on deep learning have been proposed for music source separation. Some state-of-the-art methods have shown that stacking many layers with many skip connections improve the SDR performance. Although such a deep and complex architecture shows outstanding performance, it usually requires numerous computing resources and time for training and evaluation. This paper proposes a two-stream neural network for music demixing, called KUIELab-MDX-Net, which shows a good balance of performance and required resources. The proposed model has a time-frequency branch and a time-domain branch, where each branch separates stems, respectively. It blends results from two streams to generate the final estimation. KUIELab-MDX-Net took second place on leaderboard A and third place on leaderboard B in the Music Demixing Challenge at ISMIR 2021. This paper also summarizes experimental results on another benchmark, MUSDB18. Our source code is available online.

研究动机与目标

Motivate a resource-efficient yet high-performing music source separation model.
Design a two-stream architecture combining time-frequency and time-domain approaches for separate stems.
Reduce computational load compared to state-of-the-art deep architectures while maintaining SDR performance.
Demonstrate effectiveness on the MDX challenge and validate on MUSDB18.

提出的方法

Implement a time-frequency branch using a TFC-TDF-U-Net v2 with architectural simplifications (multiplicative skip connections, removal of most skip paths).
Incorporate a time-domain branch based on pretrained Demucs without fine-tuning to provide an additional source estimate.
Add a Mixer network to fuse independently estimated sources and the mixture to refine final outputs.
Apply source-specific preprocessing including frequency cutting to extend effective n_fft within time limits.
Train four single-target separation models per source and then train the Mixer with frozen separation models.
Blend outputs from the two streams via a weighted average to generate final estimates.

实验结果

研究问题

RQ1Can a two-stream architecture (time-frequency and time-domain) achieve competitive SDR with reduced resources for music demixing?
RQ2What architectural and preprocessing adjustments yield a favorable balance between performance and computation time for MDX-compliant models?
RQ3Does a Mixer component improve separation by exploiting cross-source information within the mixture?

主要发现

Model	Vocals SDR	Drums SDR	Bass SDR	Other SDR
TFC-TDF-U-Net v1 (Choi et al., 2020)	7.98	6.11	5.94	5.02
X-UMX (Sawata et al., 2021)	6.61	6.47	5.43	4.64
Demucs (Défossez et al., 2021)	6.84	6.86	7.01	4.42
D3Net (Takahashi & Mitsufuji, 2021)	7.24	7.01	5.25	4.53
ResUNetDecouple+ (Kong et al., 2021)	8.98	6.62	6.04	5.29
TFC-TDF-U-Net v2	8.81	6.52	7.65	5.70
v2 + Mixer	8.91	7.07	7.33	5.81
v2 + Demucs	8.80	7.14	8.11	5.90
KUIELab-MDX-Net	9.00	7.33	7.86	5.95

KUIELab-MDX-Net achieves SDR competitive with or superior to several SOTA models on MUSDB18 across most instruments.
v2 (time-frequency) with Mixer outperforms several prior methods, achieving best SDR for vocals, drums, and other, with bass close to SOTA.
Incorporating a time-domain branch and a Mixer provides additional gains over single-stream approaches.
The model ranks second on Leaderboard A and third on Leaderboard B in the MDX 2021 challenge.
The approach demonstrates strong performance while using a downsized architecture relative to some deep baselines.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。