QUICK REVIEW

[论文解读] Synergy between human and machine approaches to sound/scene recognition and processing: An overview of ICASSP special session

Laurie M. Heller, Benjamin Elizalde|arXiv (Cornell University)|Feb 20, 2023

Music and Audio Processing被引用 10

一句话总结

本论文总结了在 ICASSP 2023 特别会话中展示的混合人机方法，回顾语义、感知和认知的人类知识如何影响机器听力，反之亦然。

ABSTRACT

Machine Listening, as usually formalized, attempts to perform a task that is, from our perspective, fundamentally human-performable, and performed by humans. Current automated models of Machine Listening vary from purely data-driven approaches to approaches imitating human systems. In recent years, the most promising approaches have been hybrid in that they have used data-driven approaches informed by models of the perceptual, cognitive, and semantic processes of the human system. Not only does the guidance provided by models of human perception and domain knowledge enable better, and more generalizable Machine Listening, in the converse, the lessons learned from these models may be used to verify or improve our models of human perception themselves. This paper summarizes advances in the development of such hybrid approaches, ranging from Machine Listening models that are informed by models of peripheral (human) auditory processes, to those that employ or derive semantic information encoded in relations between sounds. The research described herein was presented in a special session on "Synergy between human and machine approaches to sound/scene recognition and processing" at the 2023 ICASSP meeting.

研究动机与目标

激励人类感知、认知和语义如何引导机器听力，以提高泛化性和效率。
调查将以人为本的知识融入语音与声音理解的混合数据驱动方法。
探索机器听力如何为人类听觉研究提供信息与验证。
识别在声音理解中使用语义与感知信息的优势与挑战。

提出的方法

综合来自 ICASSP 2023 特殊会话关于混合方法的研究成果，涉及声音/场景识别。
讨论两大混合范式：通过语义/感知分析来提升理解，以及通过数据驱动模型来评估人类听力。
给出语义嵌入和本体论对模型有益的示例（如 SemDNN、基于本体的 GCN）以及感知度量对合成与评估的引导示例。
突出在说话人验证和声音匹配等方面比较机器与人类听力的研究。
给出混合方法的优势与未来发展方向的结论。

实验结果

研究问题

RQ1将人类语义与感知知识纳入机器听力模型的优点与局限是什么？
RQ2语义嵌入和本体论如何改进声音事件识别与合成任务？
RQ3数据驱动模型在多大程度上能够预测或评估人类听力结果？
RQ4使用基于感知的度量来推动机器听力改进存在哪些挑战？

主要发现

利用语义信息的混合方法可以改善声音的语义标注和不相似性预测。
神经网络中的本体论未必总是能提升弱标注声音事件分类。
感知分析可以使机器说话人嵌入在一次性验证中与人类判断对齐。
感知损失函数可以引导音频合成更好地匹配目标声音，而不增加训练时间。
数据驱动模型可以识别出人类语音感知中的关键因素，并突出中频听觉的重要性。
听觉感知的计算模型可以在 HRTF 与空间音频场景下支持评估与个性化。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。