QUICK REVIEW

[论文解读] Event-Independent Network for Polyphonic Sound Event Localization and Detection

Yin Cao, Turab Iqbal|arXiv (Cornell University)|Sep 30, 2020

Music and Audio Processing参考文献 22被引用 24

一句话总结

本文提出了一种端到端、事件无关的神经网络，用于使用一阶Ambisonics（FOA）输入进行多音轨声音事件定位与检测（SELD）。该方法采用逐轨道预测与帧级排列不变训练（tPIT），并引入一种新型事件活动检测（EAD）头，联合优化声音事件检测（SED）与波束方向（DoA）估计，在DCASE 2020 Task 3数据集上显著优于先前的两阶段基线方法。

ABSTRACT

Polyphonic sound event localization and detection is not only detecting what sound events are happening but localizing corresponding sound sources. This series of tasks was first introduced in DCASE 2019 Task 3. In 2020, the sound event localization and detection task introduces additional challenges in moving sound sources and overlapping-event cases, which include two events of the same type with two different direction-of-arrival (DoA) angles. In this paper, a novel event-independent network for polyphonic sound event localization and detection is proposed. Unlike the two-stage method we proposed in DCASE 2019 Task 3, this new network is fully end-to-end. Inputs to the network are first-order Ambisonics (FOA) time-domain signals, which are then fed into a 1-D convolutional layer to extract acoustic features. The network is then split into two parallel branches. The first branch is for sound event detection (SED), and the second branch is for DoA estimation. There are three types of predictions from the network, SED predictions, DoA predictions, and event activity detection (EAD) predictions that are used to combine the SED and DoA features for on-set and off-set estimation. All of these predictions have the format of two tracks indicating that there are at most two overlapping events. Within each track, there could be at most one event happening. This architecture introduces a problem of track permutation. To address this problem, a frame-level permutation invariant training method is used. Experimental results show that the proposed method can detect polyphonic sound events and their corresponding DoAs. Its performance on the Task 3 dataset is greatly increased as compared with that of the baseline method.

研究动机与目标

解决两阶段方法在检测同类型但方向不同的重叠声音事件时的局限性。
开发一种端到端框架，联合预测声音事件、DoA及事件活动，以提升时间与空间定位性能。
通过引入帧级排列不变训练（tPIT）解决多轨道预测中的轨道排列模糊问题。
通过SE与DoA特征融合的事件活动检测（EAD）头，提升起始与结束时刻估计的准确性。
构建可扩展的架构，使其可扩展至处理超过两个重叠事件。

提出的方法

网络通过一维卷积层处理一阶Ambisonics（FOA）时域信号，以提取声学特征。
特征流分为两条并行分支：一条用于声音事件检测（SED），另一条用于波束方向（DoA）估计。
模型每帧输出三项预测：SED、DoA及事件活动检测（EAD），每项最多包含两条轨道，表示每条轨道最多一个事件。
采用帧级排列不变训练（tPIT）策略，通过在反向传播中选择所有可能轨道排列中损失最小者，解决轨道排列模糊问题。
事件活动检测（EAD）头将来自SED与DoA分支的特征嵌入进行融合，以预测事件是否存在，并提升起始/结束时刻估计性能。
对EAD预测应用0.5的阈值进行二值化处理，SE与EAD输出共同作为掩码，用于筛选活跃轨道。

实验结果

研究问题

RQ1端到端、事件无关的网络能否有效检测并定位多个重叠的声音事件，包括同类型但方向不同的事件？
RQ2帧级排列不变训练（tPIT）在具有模糊轨道分配的多轨道SELD中如何提升性能？
RQ3引入事件活动检测（EAD）头在多音轨SELD中对起始与结束时刻预测准确性的提升程度如何？
RQ4通过EAD联合建模SED与DoA特征，与SED与DoA之间单向依赖关系相比，性能差异如何？
RQ5所提出的架构能否扩展至处理超过两个重叠事件？

主要发现

所提出的事件无关端到端系统结合tPIT与EAD，在DCASE 2020 Task 3数据集上优于所有基线方法，包括DCASE 2019的两阶段方法。
消融实验表明，若移除EAD与tPIT，性能最差，表明两者对获得最优结果均至关重要。
“Track-Wise 3”变体使用SED与EAD预测共同作为掩码，性能优于仅使用SED的“Track-Wise 2”，证明了EAD在时间与轨道绑定一致性方面的有效性。
尽管定位召回率（LR_CD）与定位误差（LE_CD）之间存在权衡，所提出的“Event-Ind”方法在所有指标上实现了最佳综合平衡。
与基线相比，模型性能显著提升，F1分数最高，误差率最低，证实了联合优化与tPIT策略的成功。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。