QUICK REVIEW

[论文解读] Music Source Separation in the Waveform Domain

Alexandre Défossez, Nicolas Usunier|arXiv (Cornell University)|Nov 27, 2019

Speech and Audio Processing参考文献 50被引用 184

一句话总结

本文比较波形域音乐源分离架构，并介绍 Demucs，一种具有 U-Net 和双向 LSTM 的 Demucs 模型，在 MusDB 上超越基于频谱的方法和 Conv-Tasnet，通过数据增强实现更高的 SDR 与自然度。

ABSTRACT

Source separation for music is the task of isolating contributions, or stems, from different instruments recorded individually and arranged together to form a song. Such components include voice, bass, drums and any other accompaniments.Contrarily to many audio synthesis tasks where the best performances are achieved by models that directly generate the waveform, the state-of-the-art in source separation for music is to compute masks on the magnitude spectrum. In this paper, we compare two waveform domain architectures. We first adapt Conv-Tasnet, initially developed for speech source separation,to the task of music source separation. While Conv-Tasnet beats many existing spectrogram-domain methods, it suffersfrom significant artifacts, as shown by human evaluations. We propose instead Demucs, a novel waveform-to-waveform model,with a U-Net structure and bidirectional LSTM.Experiments on the MusDB dataset show that, with proper data augmentation, Demucs beats allexisting state-of-the-art architectures, including Conv-Tasnet, with 6.3 SDR on average, (and up to 6.8 with 150 extra training songs, even surpassing the IRM oracle for the bass source).Using recent development in model quantization, Demucs can be compressed down to 120MBwithout any loss of accuracy.We also provide human evaluations, showing that Demucs benefit from a large advantagein terms of the naturalness of the audio. However, it suffers from some bleeding,especially between the vocals and other source.

研究动机与目标

推动波形域音乐源分离方法，超越频谱掩蔽。
将 Conv-Tasnet 适配为 44.1 kHz 的立体声音乐并评估，识别伪影。
引入 Demucs，一种新颖的波形到波形体系结构，并评估其相对于最先进方法的性能。

提出的方法

将 Conv-Tasnet 架构适配为 44.1 kHz 的立体声音乐，调整编码器/解码器设置。
将源重建定义为回归损失（L1），而非 SI-SNR。
开发 Demucs，具有 U-Net 编码器–解码器，并在它们之间加入双向 LSTM，使用宽转置卷积和门控线性单元。
应用数据增强，包括音高/速度（节奏）移位，以提高泛化能力。
在 MusDB 数据集上，将波形域模型与频谱域基线进行比较。
评估生成音频在人工评估中的自然度和伪影水平。

实验结果

研究问题

RQ1波形域架构是否能够在 MusDB 上实现比频谱域方法更高的 SDR？
RQ2伪影是否限制 Conv-Tasnet 在音乐分离中的性能，波形到波形模型是否能缓解？
RQ3在数据增强后，Demucs 架构是否超过最先进的频谱域方法和 Conv-Tasnet？
RQ4音高/节奏移位增强对 Demucs 和 Conv-Tasnet 的性能有多大影响？
RQ5根据人工评估，Demucs 在自然度和源之间的溢出方面的表现如何？

主要发现

Demucs 在 MusDB 上平均达到 6.3 SDR，且不使用额外训练数据，超越现有最佳方法（6.0 SDR）。
在额外的 150 首训练歌曲下，Demucs 可达到 6.8 SDR，超越低音源的 IRM oracle（7.6 SDR 对 7.1 IRM）。
Conv-Tasnet 在波形模型中虽强，但会产生伪影和空洞乐器攻击，这在 Demucs 中并不那么突出。
音高/节奏移位的数据增强为 Demucs 带来 0.4 SDR 的增益，但对 Conv-Tasnet 的益处较小。
在人工评估中，Demucs 在自然度方面表现出显著优势，尽管人声与其他源之间存在一些溢出。
通过量化可将 Demucs 压缩至约 120MB，且不损失准确性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。