[Paper Review] Sound Event Detection and Separation: a Benchmark on Desed Synthetic Soundscapes
The paper benchmarks state-of-the-art sound event detection (SED) systems on DESED synthetic soundscapes, analyzing time localization, reverberation, and non-target events, and evaluates the impact of sound separation as a preprocessing step.
We propose a benchmark of state-of-the-art sound event detection systems (SED). We designed synthetic evaluation sets to focus on specific sound event detection challenges. We analyze the performance of the submissions to DCASE 2021 task 4 depending on time related modifications (time position of an event and length of clips) and we study the impact of non-target sound events and reverberation. We show that the localization in time of sound events is still a problem for SED systems. We also show that reverberation and non-target sound events are severely degrading the performance of the SED systems. In the latter case, sound separation seems like a promising solution.
Motivation & Objective
- Motivate robust SED for real-world, multi-event environments with weakly labeled training data.
- Assess how synthetic DESED soundscapes can reveal specific SED challenges (timing, overlap, reverberation).
- Evaluate the impact of sound separation as a preprocessing step on SED performance under challenging conditions.
Proposed method
- Use synthetic evaluation sets designed to isolate SED challenges (timing, duration, overlap, reverberation).
- Benchmark submissions to DCASE 2020 task 4 on synthetic evaluation sets and official real-evaluation data.
- Analyze robustness to non-target events and reverberation, with and without SSep preprocessing.
- Employ event-based F-score with 200 ms onset collar and a flexible offset collar for evaluation.
Experimental results
Research questions
- RQ1How does time localization within a clip affect SED performance, especially for long events?
- RQ2What is the impact of reverberation and non-target events on SED performance, and can SSep mitigate these effects?
- RQ3Does the length of clips (10 s vs 60 s) and event density affect detection robustness?
- RQ4Can SSep preprocessing improve robustness to non-target events without harming baseline SED performance?
- RQ5What are the limitations of current evaluation metrics (collar-based) in long-event scenarios?
Key findings
- Reverberation degrades SED performance by about 15% on average in F-score.
- Performance degrades when using 60 s clips compared to synthetic reference (ref), with several systems showing substantial drops in recall indicating segmentation/temporal localization issues.
- Time localization within clips has only a minor impact for short events but degrades for long events when the event occurs toward the end of the clip, suggesting windowing/post-processing biases.
- Systems with SSep show reduced degradation due to non-target events (about 12.5% vs 19% without SSep in F-score across TNTSNR conditions).
- SSep does not consistently improve performance when no non-target events are present (TNTSNR_inf).
- Longer clips (60 s) generally pose more difficulty for SED systems, primarily due to lower recall and potential threshold adaptations.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.