QUICK REVIEW

[論文レビュー] The Costs of Reproducibility in Music Separation Research: a Replication of Band-Split RNN

Paul Magron, Romain Serizel|arXiv (Cornell University)|Mar 10, 2026

Music and Audio Processing被引用数 0

ひとこと要約

この論文は、音楽ソース分離の Band-Split RNN（BSRNN）を再現し、再現性コストを分析し、透明性の高い、エネルギー配慮型の研究を促進するためにコードと事前学習モデルを公開します。

ABSTRACT

Music source separation is the task of isolating the instrumental tracks from a music song. Despite its spectacular recent progress, the trend towards more complex architectures and training protocols exacerbates reproducibility issues. The band-split recurrent neural networks (BSRNN) model is promising in this regard, since it yields close to state-of-the-art results on public datasets, and requires reasonable resources for training. Unfortunately, it is not straightforward to reproduce since its full code is not available. In this paper, we attempt to replicate BSRNN as closely as possible to the original paper through extensive experiments, which allows us to conduct a critical reflection on this reproducibility issue. Our contributions are three-fold. First, this study yields several insights on the model design and training pipeline, which sheds light on potential future improvements. In particular, since we were unsuccessful in reproducing the original results, we explore additional variants that ultimately yield an optimized BSRNN model, whose performance largely improves that of the original. Second, we discuss reproducibility issues from both methodological and practical perspectives. We notably underline how substantial time and energy costs could have been saved upon availability of the full pipeline. Third, our code and pre-trained models are released publicly to foster reproducible research. We hope that this study will contribute to spread awareness on the importance of reproducible research in the music separation community, and help promoting more transparent and sustainable practices.

研究の動機と目的

BSRNNおよび関連するMSSモデルをMUSDB18-HQで再現する際の再現性の課題を評価する。
再現性能に影響を与える設計、訓練、データ生成要因を特定する。
再現結果と元の主張のギャップを埋めるための varianteを提案・評価する。
音楽ソース分離における再現可能な研究に伴うエネルギーと時間コストを強調する。
コミュニティを支えるオープンで実行可能な実装と事前学習モデルを提供する。

提案手法

MUSDB18-HQ上で元のBand-Split RNN（BSRNN）アーキテクチャと訓練パイプラインをできるだけ厳密に再現する。
パフォーマンスギャップを探るための variante（ステレオモデリング、別のレイヤー、自己注意、マルチヘッド機構）を提案・実装する。
制約のある計算資源下で小規模および大規模モデル構成を訓練し、有効学習率に合わせて訓練ハイパーパラメータを調整する。
MUSDB18-HQにおける utterance SDR（uSDR）と chunk SDR（cSDR）を用いて評価し、CodeCarbonとGreen Algorithmsの推定値でエネルギー消費を報告する。
再現可能な研究とさらなる実験を可能にするため、コードと事前学習モデルを公に公開する。

Figure 1: Overview of BSRNN and its variants (A). Blocs with dashed line contours denote variants not in the original architecture. The residual network (B) in the sequence / band modules are based on either RNNs, as in the original model, or dilated convolutions.

実験結果

リサーチクエスチョン

RQ1BSRNNをMUSDB18-HQで再現する際の主要な再現性障壁は何か。
RQ2アーキテクチャと訓練の変化はMSSの性能と再現性コストにどう影響するか。
RQ3的を絞った_variantで再現結果と元のBSRNNの性能とのギャップを埋められるか。
RQ4再現可能なMSS研究を追求する際のエネルギーと時間の影響は何か。
RQ5オープンで実行可能なパイプラインを提供することで再現性とコミュニティの採用は改善されるか。

主な発見

Model	Vocals uSDR (dB)	Bass uSDR (dB)	Drums uSDR (dB)	Other uSDR (dB)	Average uSDR (dB)	Parameters (M)	Energy (codecarbon, kWh)	Energy (green algo., kWh)
Base model: N=64, R=8	7.7	6.1	9.7	4.8	7.1	32.3	127	168
Accumulating gradients	8.0	5.8	9.6	4.9	7.1	-	129	170
Monitoring with the loss	7.5	6.4	9.3	4.8	7.1	-	120	159
Loss domain: time	7.9	6.1	9.4	4.9	7.2	-	116	153
Loss domain: STFT	7.9	6.4	9.6	4.9	7.2	-	131	173
STFT: window=4096, hop=1024	7.3	5.9	8.7	4.4	6.6	37.1	58	92
Masker factor μ=2	7.9	6.8	9.4	4.4	7.1	20.6	110	151
Large model: N=128, R=12	9.2	7.3	10.3	5.8	8.2	146.7	230	321
Large model with patience=30	9.5	7.8	10.3	6.3	8.4	-	354	495
Stereo Naive	7.7	6.6	8.4	4.0	6.7	37.1	78	122
Naive, with μ=8	7.9	6.1	8.7	4.3	6.7	81.1	87	140
TAC with TanH	7.6	6.0	9.6	4.3	6.8	34.7	117	154
TAC with PReLU	7.9	6.5	10.0	4.7	7.3	34.7	128	167
BSCNN	7.3	5.9	9.0	4.2	6.6	29.7	113	153
Attention: Na=1, Ea=8	7.7	7.4	10.4	4.8	7.6	33.0	151	199
Attention: Na=2, Ea=16	8.2	7.7	10.4	4.9	7.8	33.2	157	224
Multi-head: H=2	7.6	5.5	9.1	4.0	6.6	22.0	91	137
Silent target (instead of all sources)	7.9	6.6	9.5	4.4	7.1	32.3	110	146
No SAD; UMX-like augmentations	8.2	6.9	9.5	5.3	7.5	-	135	179
Optimized models: with TAC	10.1	9.1	10.9	6.7	9.2	149.9	426	593
+ TAC	10.2	10.2	11.3	6.9	9.6	164.1	508	711

元のBSRNNの結果を再現するには困難が伴い、いくつかの_variantで元の報告を上回る性能向上が見られた。
ステレオモデリング、自己注意、慎重なデータ生成の選択はMSSの性能とリソース使用に顕著な影響を与える。
TAC、注意機構、より大きなモデルを用いた最適化された_variantはベースモデルより検証用uSDRを高く達成したが、エネルギーコストは増加した。
推論および評価パイプライン（例：セグメント長・オーバーラップ追加戦略の違い）は、ボーカルのテストスコアに最大約0.3 dBの影響を与える可能性がある。
コードと事前学習モデルの公開は再現性の障壁を下げ、よりエネルギー意識が高く透明なMSS研究を促進する。

Figure 2: Validation uSDR over epochs for the base model on the vocals track, with a patience of 10 (left) or 30 (right). Each color corresponds to a different run, and the dashed lines correspond to each run’s best uSDR.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。