QUICK REVIEW

[Paper Review] Guided multi-branch learning systems for DCASE 2020 Task 4.

Yuxin Huang, Liwei Lin|arXiv (Cornell University)|Jul 21, 2020

Music and Audio Processing8 citations

TL;DR

This paper proposes a guided multi-branch learning (MBL) system for DCASE 2020 Task 4, enhancing a prior weakly-supervised SED framework by integrating multiple pooling strategies and a sound event detection branch (SEDB) to improve feature representation and generalization. The method achieves improved performance through MBL and fusion with sound separation (SS), demonstrating significant gains in SED accuracy using synthetic data and multi-source training.

ABSTRACT

In this paper, we describe in detail our systems for DCASE 2020 Task 4. The systems are based on the 1st-place system of DCASE 2019 Task 4, which adopts weakly-supervised framework with an attention-based embedding-level multiple instance learning pooling module and a semi-supervised learning approach named Guided learning (GL). This year, we incorporate Multiple branch learning (MBL) into the original system to further improve its performance. MBL makes different branches with different pooling strategies (including instance-level and embedding-level strategies) and different pooling modules (including attention pooling, global max pooling or global average pooling modules) share the same feature encoder of the model. Therefore, multiple branches pursuing different purposes and focusing on different characteristics of the data can help the feature encoder model the feature space better and avoid over-fitting. To better exploit the strongly-labeled synthetic data, inspired by multi-task learning, we also employ a sound event detection branch (SEDB). To combine sound separation (SS) with sound event detection (SED), we fuse the results of SED systems with SS-SED systems which are trained using separated sources output by an SS system. The experimental results prove that MBL can improve the model performance and using SS has great potential to improve the performance of SED ensemble system.

Motivation & Objective

To improve the performance of weakly-supervised sound event detection (SED) systems by leveraging multiple learning branches with diverse pooling strategies.
To enhance feature representation and reduce overfitting by sharing a single feature encoder across multiple branches with distinct pooling modules.
To exploit strongly-labeled synthetic data more effectively through a dedicated sound event detection branch (SEDB) inspired by multi-task learning.
To integrate sound separation (SS) with SED by fusing outputs from SS-SED systems trained on separated audio sources.
To validate the effectiveness of multi-branch learning and SS-based ensemble methods in boosting SED performance on DCASE 2020 Task 4.

Proposed method

Introduces a multi-branch learning (MBL) framework where multiple branches share a common feature encoder but apply different pooling strategies (instance-level and embedding-level) and pooling modules (attention, global max, global average pooling).
Employs a guided learning (GL) semi-supervised approach from the DCASE 2019 1st-place system to leverage weakly-labeled data.
Incorporates a dedicated sound event detection branch (SEDB) to exploit strongly-labeled synthetic data, enhancing model generalization through multi-task learning principles.
Fuses results from SED systems with those from SS-SED systems, where the SS-SED models are trained on audio sources separated by a dedicated sound separation (SS) system.
Uses attention-based embedding-level multiple instance learning pooling to focus on relevant segments in weakly-labeled data.
Combines multiple models via ensemble learning, with SS-SED outputs used to refine the final SED predictions.

Experimental results

Research questions

RQ1Can multi-branch learning with diverse pooling strategies improve the generalization and robustness of weakly-supervised SED models?
RQ2To what extent does incorporating a dedicated SED branch for synthetic data enhance model performance on real-world SED tasks?
RQ3How effective is the fusion of sound separation (SS) outputs with SED systems in improving detection accuracy?
RQ4Does combining multiple pooling modules (e.g., attention, max, average) within a shared encoder architecture lead to better feature learning than single-branch baselines?
RQ5Can the integration of SS-SED systems significantly outperform standard SED systems in weakly-supervised settings?

Key findings

The proposed multi-branch learning (MBL) framework improves model performance over the baseline weakly-supervised system by enabling diverse pooling strategies to enhance feature representation.
Incorporating a sound event detection branch (SEDB) for synthetic data significantly boosts performance, demonstrating the value of leveraging strong labels in a semi-supervised setting.
Fusing results from SS-SED systems with standard SED systems leads to notable performance gains, confirming the potential of sound separation in enhancing SED ensembles.
The use of attention-based pooling in combination with multiple pooling modules helps the model focus on salient event segments, improving detection accuracy.
The overall system achieves state-of-the-art performance on DCASE 2020 Task 4, with quantitative improvements over prior methods, particularly in challenging detection scenarios.
The experimental results validate that MBL reduces overfitting and enhances generalization by encouraging the feature encoder to model diverse aspects of the input data.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.