QUICK REVIEW

[論文レビュー] SONYC-UST-V2: An Urban Sound Tagging Dataset with Spatiotemporal Context

Mark Cartwright, Jason Cramer|arXiv (Cornell University)|Sep 11, 2020

Music and Audio Processing参考文献 12被引用数 30

ひとこと要約

SONYC-UST-V2 は、複数ラベルの音声タグ付けのための、空間的・時間的メタデータ（センサ位置と時刻）を含む18,510の注釈付き10秒の都市音響録音を提供し、STC情報を用いたベースライン実験も含む。

ABSTRACT

We present SONYC-UST-V2, a dataset for urban sound tagging with spatiotemporal information. This dataset is aimed for the development and evaluation of machine listening systems for real-world urban noise monitoring. While datasets of urban recordings are available, this dataset provides the opportunity to investigate how spatiotemporal metadata can aid in the prediction of urban sound tags. SONYC-UST-V2 consists of 18510 audio recordings from the "Sounds of New York City" (SONYC) acoustic sensor network, including the timestamp of audio acquisition and location of the sensor. The dataset contains annotations by volunteers from the Zooniverse citizen science platform, as well as a two-stage verification with our team. In this article, we describe our data collection procedure and propose evaluation metrics for multilabel classification of urban sound tags. We report the results of a simple baseline model that exploits spatiotemporal information.

研究の動機と目的

SONYC-UST-V2, a large urban sound tagging dataset with spatiotemporal context (STC).
Provide metadata linking each recording to NYC sensor location and acquisition hour to enable context-aware modeling.
Describe data collection, annotation workflow (crowdsourced and verified), and dataset splits ensuring sensor/time disjointness.
Propose evaluation metrics for multilabel tagging at coarse and fine taxonomy levels and establish a baseline incorporating STC.

提案手法

Multi-label neural baseline using OpenL3 audio embeddings for content features.
Concatenate audio embeddings with spatial (latitude, longitude) and temporal (hour, day, week) context.
Train a multilayer perceptron with a single hidden layer and AutoPool for frame aggregation.
Handle incomplete annotations with a coarse-to-fine tag coarsening strategy during evaluation.
Compare baseline performance with and without spatiotemporal context (STC).
Use minority vote aggregation for training labels in the presence of crowdsourced weak annotations.

実験結果

リサーチクエスチョン

RQ1Can spatiotemporal context improve urban sound tagging performance on real-world sensor data?
RQ2How do coarse and fine tag predictions differ in performance when leveraging STC?
RQ3What are effective strategies to train and evaluate multilabel urban sound tagging with noisy crowdsourced annotations?
RQ4How do verified vs. crowdsourced annotations influence model performance and generalization across sensors?
RQ5What is the impact of dataset disjointness by sensor and temporal displacement on generalization?

主な発見

SONYC-UST-V2 contains 18,510 annotated recordings from 56 sensors (2016–2019).
Annotations include 23 fine-grained tags and 8 coarse tags, with spatiotemporal metadata (block-level location and hourly timestamps).
A simple baseline using STC shows limited improvement over non-STC in this setup, highlighting the need for more sophisticated STC methods.
Two-tier evaluation (coarse and fine) uses macro-AUPRC, micro-AUPRC, and LWLRAP metrics, with a coarsening approach for uncertain fine labels.
The dataset includes both crowdsourced and verified annotations, enabling exploration of annotation aggregation and reliability modeling.
Dataset design enforces disjoint sensor splits between training/validation and temporally displaced test data to assess generalization.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。