QUICK REVIEW

[论文解读] Sound Event Detection and Separation: a Benchmark on Desed Synthetic Soundscapes

Nicolas Turpault, Romain Serizel|arXiv (Cornell University)|Nov 2, 2020

Music and Audio Processing参考文献 30被引用 30

一句话总结

本文在 DESED 合成声景上对最先进的音事件检测 (SED) 系统进行基准测试，分析时间定位、混响和非目标事件，并评估将声音分离作为预处理步骤的影响。

ABSTRACT

We propose a benchmark of state-of-the-art sound event detection systems (SED). We designed synthetic evaluation sets to focus on specific sound event detection challenges. We analyze the performance of the submissions to DCASE 2021 task 4 depending on time related modifications (time position of an event and length of clips) and we study the impact of non-target sound events and reverberation. We show that the localization in time of sound events is still a problem for SED systems. We also show that reverberation and non-target sound events are severely degrading the performance of the SED systems. In the latter case, sound separation seems like a promising solution.

研究动机与目标

为现实世界中的多事件环境在弱监督训练数据下实现鲁棒 SED 提供动机。
评估合成的 DESED 声景如何揭示 SED 的特定挑战（时序、重叠、混响）。
评估将声音分离作为在具有挑战性条件下对 SED 性能的预处理步骤的影响。

提出的方法

使用设计用来隔离 SED 挑战的合成评估集（时序、持续时间、重叠、混响）。
在 DCASE 2020 任务 4 的合成评估集和官方真实评估数据上对提交进行基准测试。
分析带有或不带 SSep 预处理时对非目标事件和混响的鲁棒性。
采用事件级 F-score，起始 collar 为 200 ms，且对结束 collar 具有灵活性。

实验结果

研究问题

RQ1剪辑内的时间定位如何影响 SED 表现，尤其是对较长事件？
RQ2混响和非目标事件对 SED 表现的影响，以及 SSep 是否能缓解这些影响？
RQ3剪辑长度（10 s 与 60 s）和事件密度是否会影响检测鲁棒性？
RQ4在没有非目标事件时，SSep 预处理是否能提高对非目标事件的鲁棒性而不损害基线 SED 性能？
RQ5当前评估指标（基于 collar 的）在长事件场景中的局限性是什么？

主要发现

混响会使 SED 性能平均下降约 15% 的 F-score。
使用 60 s 剪辑相较于合成参考（ref）时性能下降，若干系统在召回率方面显著下降，表明分割/时间定位问题。
在剪辑内的时间定位对短事件影响较小，但当事件发生在剪辑末端时对长事件的影响较大，暗示窗口/后处理偏差。
带有 SSep 的系统对非目标事件的降解较小（在 TNTSNR 条件下，F-score 约 12.5% 对比 19% 无 SSep）。
在没有非目标事件时，SSep 并不总是提升性能（TNTSNR_inf）。
更长的剪辑（60 s）通常对 SED 系统更具挑战性，主要是召回率下降和阈值调整的可能性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。