QUICK REVIEW

[论文解读] The Costs of Reproducibility in Music Separation Research: a Replication of Band-Split RNN

Paul Magron, Romain Serizel|arXiv (Cornell University)|Mar 10, 2026

Music and Audio Processing被引用 0

一句话总结

本论文复现 Band-Split RNN (BSRNN) 以用于音乐源分离，分析可重复性成本，并发布代码与预训练模型以促进透明、注重能耗的研究。

ABSTRACT

Music source separation is the task of isolating the instrumental tracks from a music song. Despite its spectacular recent progress, the trend towards more complex architectures and training protocols exacerbates reproducibility issues. The band-split recurrent neural networks (BSRNN) model is promising in this regard, since it yields close to state-of-the-art results on public datasets, and requires reasonable resources for training. Unfortunately, it is not straightforward to reproduce since its full code is not available. In this paper, we attempt to replicate BSRNN as closely as possible to the original paper through extensive experiments, which allows us to conduct a critical reflection on this reproducibility issue. Our contributions are three-fold. First, this study yields several insights on the model design and training pipeline, which sheds light on potential future improvements. In particular, since we were unsuccessful in reproducing the original results, we explore additional variants that ultimately yield an optimized BSRNN model, whose performance largely improves that of the original. Second, we discuss reproducibility issues from both methodological and practical perspectives. We notably underline how substantial time and energy costs could have been saved upon availability of the full pipeline. Third, our code and pre-trained models are released publicly to foster reproducible research. We hope that this study will contribute to spread awareness on the importance of reproducible research in the music separation community, and help promoting more transparent and sustainable practices.

研究动机与目标

评估在 MUSDB18-HQ 上复现 BSRNN 及相关 MSS 模型时的可重复性挑战。
识别影响复现实验性能的设计、训练与数据生成因素。
提出并评估变体以缩小“复现实验结果”与“原始结论”之间的差距。
强调可重复研究在音乐源分离中的能耗与时间成本。
提供一个开放、可运行的实现及预训练模型以支持社区。

提出的方法

在 MUSDB18-HQ 上尽可能接近复现原始 Band-Split RNN (BSRNN) 的架构与训练流程。
提出并实现变体（立体声建模、替代层、自注意力、多头机制）以探索性能差距。
在受限算力的情况下训练小型与大型模型配置；调整训练超参数以匹配有效学习率。
在 MUSDB18-HQ 上使用 utterance SDR (uSDR) 与 chunk SDR (cSDR) 进行评估；使用 CodeCarbon 与 Green Algorithms 估算能耗并报告。
公开发布代码与预训练模型以实现可重复研究与进一步实验。

Figure 1: Overview of BSRNN and its variants (A). Blocs with dashed line contours denote variants not in the original architecture. The residual network (B) in the sequence / band modules are based on either RNNs, as in the original model, or dilated convolutions.

实验结果

研究问题

RQ1在在 MUSDB18-HQ 上复现 BSRNN 时的核心可重复性障碍是什么？
RQ2结构、训练变体如何影响 MSS 性能与可重复性成本？
RQ3是否能通过针对性变体缩小复现实验结果与原始 BSRNN 性能之间的差距？
RQ4追求可重复 MSS 研究的能源与时间影响是什么？
RQ5提供一个开放、可运行的流水线是否能提升可重复性与社区采用度？

主要发现

模型	声乐 uSDR (dB)	低音 uSDR (dB)	鼓组 uSDR (dB)	其他uSDR (dB)	平均 uSDR (dB)	参数量 (M)	能耗 (codecarbon, kWh)	能耗 (green algo., kWh)
Base model: N=64, R=8	7.7	6.1	9.7	4.8	7.1	32.3	127	168
Accumulating gradients	8.0	5.8	9.6	4.9	7.1	-	129	170
Monitoring with the loss	7.5	6.4	9.3	4.8	7.1	-	120	159
Loss domain: time	7.9	6.1	9.4	4.9	7.2	-	116	153
Loss domain: STFT	7.9	6.4	9.6	4.9	7.2	-	131	173
STFT: window=4096, hop=1024	7.3	5.9	8.7	4.4	6.6	37.1	58	92
Masker factor μ=2	7.9	6.8	9.4	4.4	7.1	20.6	110	151
Large model: N=128, R=12	9.2	7.3	10.3	5.8	8.2	146.7	230	321
Large model with patience=30	9.5	7.8	10.3	6.3	8.4	-	354	495
Stereo Naive	7.7	6.6	8.4	4.0	6.7	37.1	78	122
Naive, with μ=8	7.9	6.1	8.7	4.3	6.7	81.1	87	140
TAC with TanH	7.6	6.0	9.6	4.3	6.8	34.7	117	154
TAC with PReLU	7.9	6.5	10.0	4.7	7.3	34.7	128	167
BSCNN	7.3	5.9	9.0	4.2	6.6	29.7	113	153
Attention: Na=1, Ea=8	7.7	7.4	10.4	4.8	7.6	33.0	151	199
Attention: Na=2, Ea=16	8.2	7.7	10.4	4.9	7.8	33.2	157	224
Multi-head: H=2	7.6	5.5	9.1	4.0	6.6	22.0	91	137
Silent target (instead of all sources)	7.9	6.6	9.5	4.4	7.1	32.3	110	146
No SAD; UMX-like augmentations	8.2	6.9	9.5	5.3	7.5	-	135	179
Optimized models: with TAC	10.1	9.1	10.9	6.7	9.2	149.9	426	593
+ TAC	10.2	10.2	11.3	6.9	9.6	164.1	508	711

复现原始 BSRNN 的结果具有挑战性；若干变体在不同方面实现了超出原始报告的显著性能提升。
立体声建模、自注意力以及谨慎的数据生成选择对 MSS 性能与资源使用有显著影响。
通过带 TAC、注意力和更大模型的优化变体，在验证集的 uSDR 上超过了基模型，但能源成本也增加。
不同的推断与评估管线（如分段大小与重叠相加策略）可能使测试分数在歌唱分量上多达约 0.3 dB 的差异。
公开发布代码和预训练模型降低了复现门槛，促进更具能效、透明的 MSS 研究。

Figure 2: Validation uSDR over epochs for the base model on the vocals track, with a patience of 10 (left) or 30 (right). Each color corresponds to a different run, and the dashed lines correspond to each run’s best uSDR.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。