[Paper Review] Universal Source Separation with Weakly Labelled Data
This work trains a universal source separation system using only weakly labelled data from AudioSet, enabling separation of hundreds of sound classes without clean sources. It achieves strong SDR improvements across multiple datasets and introduces hierarchical, query-based separation conditioned on anchor segments.
Universal source separation (USS) is a fundamental research task for computational auditory scene analysis, which aims to separate mono recordings into individual source tracks. There are three potential challenges awaiting the solution to the audio source separation task. First, previous audio source separation systems mainly focus on separating one or a limited number of specific sources. There is a lack of research on building a unified system that can separate arbitrary sources via a single model. Second, most previous systems require clean source data to train a separator, while clean source data are scarce. Third, there is a lack of USS system that can automatically detect and separate active sound classes in a hierarchical level. To use large-scale weakly labeled/unlabeled audio data for audio source separation, we propose a universal audio source separation framework containing: 1) an audio tagging model trained on weakly labeled data as a query net; and 2) a conditional source separation model that takes query net outputs as conditions to separate arbitrary sound sources. We investigate various query nets, source separation models, and training strategies and propose a hierarchical USS strategy to automatically detect and separate sound classes from the AudioSet ontology. By solely leveraging the weakly labelled AudioSet, our USS system is successful in separating a wide variety of sound classes, including sound event separation, music source separation, and speech enhancement. The USS system achieves an average signal-to-distortion ratio improvement (SDRi) of 5.57 dB over 527 sound classes of AudioSet; 10.57 dB on the DCASE 2018 Task 2 dataset; 8.12 dB on the MUSDB18 dataset; an SDRi of 7.28 dB on the Slakh2100 dataset; and an SSNR of 9.00 dB on the voicebank-demand dataset. We release the source code at https://github.com/bytedance/uss
Motivation & Objective
- Motivate universal source separation (USS) capable of handling arbitrary sources with a single model.
- Overcome reliance on clean source data by leveraging large-scale weakly labelled data (AudioSet).
- Automatically detect and separate active sound classes via a hierarchical, ontology-aware approach.
- Develop a query-based separation framework where conditioning signals guide separation.
- Investigate the impact of various query nets, anchor mining strategies, and training schemes on USS performance.
Proposed method
- Propose a four-step USS pipeline that uses weakly labelled data: sampling, anchor segment mining, audio tagging to produce query embeddings, and mixture-based training of a conditional separator.
- Use anchor segments mined via pretrained or finetuned audio tagging models (PANNs or HTS-AT) to create short, likely-active segments for training.
- Employ a ResUNet-based source separator conditioned by FiLM-modulated embeddings derived from query nets (hard one-hot, soft probabilities, latent embeddings, or learnable embeddings).
- Train end-to-end with L1 loss on waveforms and apply energy-based data augmentation to balance anchor pair energies.
- Adopt a hierarchical AudioSet ontology to perform automatic, level-wise active sound detection and separation, enabling scalable USS across levels of granularity.

Experimental results
Research questions
- RQ1Can USS be trained solely with weakly labelled data to separate hundreds of sound classes?
- RQ2How effective are anchor-segment mining and different query embeddings in guiding separation?
- RQ3Does hierarchical ontology-based detection enable automatic, scalable USS across AudioSet levels?
- RQ4What are the SDRi gains on diverse datasets when training only with AudioSet?
- RQ5How do data augmentation and energy balancing affect separation performance?
Key findings
- The USS system trained only on AudioSet achieves SDR improvements (SDRi) across multiple datasets: 5.57 dB over 527 AudioSet classes; 10.57 dB on DCASE 2018 Task 2; 8.12 dB on MUSDB18; 7.28 dB on Slakh2100; and 9.00 dB SSNR on voicebank-demand.
- Anchor segment mining with SED models enables localizing target events within weakly labelled clips, enabling training without clean sources.
- Query embeddings derived from audio tagging models (hard/soft/latent/learnable) effectively condition the separator, with FiLM-based integration into a ResUNet backbone.
- Hierarchical ontology grouping allows automatic detection and separation at different levels of AudioSet, reducing the need for predefined target lists.
- The framework demonstrates broad applicability to sound event separation, music source separation, and speech enhancement without resorting to clean-source data.

Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.