[论文解读] KUIELab-MDX-Net: A Two-Stream Neural Network for Music Demixing
KUIELab-MDX-Net 提出一个两流音乐 demixing 模型,包含时频和时域分支,融合其输出以在 MDX 2021 基准上实现强 SDR。
Recently, many methods based on deep learning have been proposed for music source separation. Some state-of-the-art methods have shown that stacking many layers with many skip connections improve the SDR performance. Although such a deep and complex architecture shows outstanding performance, it usually requires numerous computing resources and time for training and evaluation. This paper proposes a two-stream neural network for music demixing, called KUIELab-MDX-Net, which shows a good balance of performance and required resources. The proposed model has a time-frequency branch and a time-domain branch, where each branch separates stems, respectively. It blends results from two streams to generate the final estimation. KUIELab-MDX-Net took second place on leaderboard A and third place on leaderboard B in the Music Demixing Challenge at ISMIR 2021. This paper also summarizes experimental results on another benchmark, MUSDB18. Our source code is available online.
研究动机与目标
- Motivate a resource-efficient yet high-performing music source separation model.
- Design a two-stream architecture combining time-frequency and time-domain approaches for separate stems.
- Reduce computational load compared to state-of-the-art deep architectures while maintaining SDR performance.
- Demonstrate effectiveness on the MDX challenge and validate on MUSDB18.
提出的方法
- Implement a time-frequency branch using a TFC-TDF-U-Net v2 with architectural simplifications (multiplicative skip connections, removal of most skip paths).
- Incorporate a time-domain branch based on pretrained Demucs without fine-tuning to provide an additional source estimate.
- Add a Mixer network to fuse independently estimated sources and the mixture to refine final outputs.
- Apply source-specific preprocessing including frequency cutting to extend effective n_fft within time limits.
- Train four single-target separation models per source and then train the Mixer with frozen separation models.
- Blend outputs from the two streams via a weighted average to generate final estimates.
实验结果
研究问题
- RQ1Can a two-stream architecture (time-frequency and time-domain) achieve competitive SDR with reduced resources for music demixing?
- RQ2What architectural and preprocessing adjustments yield a favorable balance between performance and computation time for MDX-compliant models?
- RQ3Does a Mixer component improve separation by exploiting cross-source information within the mixture?
主要发现
| Model | Vocals SDR | Drums SDR | Bass SDR | Other SDR |
|---|---|---|---|---|
| TFC-TDF-U-Net v1 (Choi et al., 2020) | 7.98 | 6.11 | 5.94 | 5.02 |
| X-UMX (Sawata et al., 2021) | 6.61 | 6.47 | 5.43 | 4.64 |
| Demucs (Défossez et al., 2021) | 6.84 | 6.86 | 7.01 | 4.42 |
| D3Net (Takahashi & Mitsufuji, 2021) | 7.24 | 7.01 | 5.25 | 4.53 |
| ResUNetDecouple+ (Kong et al., 2021) | 8.98 | 6.62 | 6.04 | 5.29 |
| TFC-TDF-U-Net v2 | 8.81 | 6.52 | 7.65 | 5.70 |
| v2 + Mixer | 8.91 | 7.07 | 7.33 | 5.81 |
| v2 + Demucs | 8.80 | 7.14 | 8.11 | 5.90 |
| KUIELab-MDX-Net | 9.00 | 7.33 | 7.86 | 5.95 |
- KUIELab-MDX-Net achieves SDR competitive with or superior to several SOTA models on MUSDB18 across most instruments.
- v2 (time-frequency) with Mixer outperforms several prior methods, achieving best SDR for vocals, drums, and other, with bass close to SOTA.
- Incorporating a time-domain branch and a Mixer provides additional gains over single-stream approaches.
- The model ranks second on Leaderboard A and third on Leaderboard B in the MDX 2021 challenge.
- The approach demonstrates strong performance while using a downsized architecture relative to some deep baselines.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。