QUICK REVIEW

[論文レビュー] Music Source Separation in the Waveform Domain

Alexandre Défossez, Nicolas Usunier|arXiv (Cornell University)|Nov 27, 2019

Speech and Audio Processing参考文献 50被引用数 184

ひとこと要約

本論文は波形領域の音源分離アーキテクチャを比較し、U-Net と双方向 LSTM を組み込んだ Demucs モデルを導入。これにより、Spectrogram ベースの手法と Conv-Tasnet を MusDB で上回り、データ拡張を用いて SDR と自然さを向上させる。

ABSTRACT

Source separation for music is the task of isolating contributions, or stems, from different instruments recorded individually and arranged together to form a song. Such components include voice, bass, drums and any other accompaniments.Contrarily to many audio synthesis tasks where the best performances are achieved by models that directly generate the waveform, the state-of-the-art in source separation for music is to compute masks on the magnitude spectrum. In this paper, we compare two waveform domain architectures. We first adapt Conv-Tasnet, initially developed for speech source separation,to the task of music source separation. While Conv-Tasnet beats many existing spectrogram-domain methods, it suffersfrom significant artifacts, as shown by human evaluations. We propose instead Demucs, a novel waveform-to-waveform model,with a U-Net structure and bidirectional LSTM.Experiments on the MusDB dataset show that, with proper data augmentation, Demucs beats allexisting state-of-the-art architectures, including Conv-Tasnet, with 6.3 SDR on average, (and up to 6.8 with 150 extra training songs, even surpassing the IRM oracle for the bass source).Using recent development in model quantization, Demucs can be compressed down to 120MBwithout any loss of accuracy.We also provide human evaluations, showing that Demucs benefit from a large advantagein terms of the naturalness of the audio. However, it suffers from some bleeding,especially between the vocals and other source.

研究の動機と目的

スペクトログラムマスキングを超える音楽ソース分離における波形領域アプローチの動機づけ。
Conv-Tasnet を 44.1 kHz のステレオ音楽へ適用・評価し、アーティファクトを特定。
Demucs の新しい波形対波形アーキテクチャを導入し、最先端手法と比較した性能を評価。

提案手法

エンコーダ/デコーダ設定を調整した上で、Conv-Tasnet アーキテクチャを 44.1 kHz のステレオ音楽へ適用。
SI-SNR の代わりにソース再構成の回帰損失（L1）を定義。
U-Net のエンコーダ–デコーダとそれらの間に Bidirectional LSTM を組み込んだ Demucs を開発。広い転置畳み込みとゲーテッド線形ユニットを使用。
ピッチ/テンポシフトを含むデータ拡張を適用し、汎化性能を向上。
MusDB データセット上で波形領域モデルをスペクトログラム領域のベースラインと比較。
生成音声の人間評価による自然さとアーティファクトのレベルを評価。

実験結果

リサーチクエスチョン

RQ1波形領域のアーキテクチャは MusDB でスペクトログラム領域の手法より高い SDR を達成できるか？
RQ2アーティファクトがConv-Tasnet の音楽分離性能を制限するか、波形対波形モデルがこれを緩和できるか？
RQ3データ拡張後、Demucs アーキテクチャは最先端のスペクトログラム領域手法および Conv-Tasnet を打ち負かすか？
RQ4Demucs と Conv-Tasnet の性能に対するピッチ/テンポシフト拡張の影響は？
RQ5人間の評価による自然さとソース間のブリーディングの観点で Demucs の性能はどうか？

主な発見

Demucs は追加の学習データなしで MusDB で平均 6.3 SDR を達成し、既存の最良手法（6.0 SDR）を上回る。
追加で 150 曲のトレーニング音源を用いると、Demucs は最大 6.8 SDR に到達し、ベースソースの IRM オラクルを上回る（7.6 SDR 対 7.1 IRM）。
Conv-Tasnet は波形モデルの中で強力だが、アーティファクトや中空の楽器アタックを生み出し、Demucs ほど顕著ではない。
ピッチ/テンポシフトを伴うデータ拡張は Demucs に 0.4 SDR の向上を提供するが、Conv-Tasnet にはあまり効果的ではない。
人間の評価による自然さの点で Demucs は大きな優位を示すが、ボーカルと他のソース間でブリーディングが発生することもある。
量子化により Demucs は約 120MB に圧縮可能で、精度劣化はない。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。