QUICK REVIEW

[论文解读] SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning

Changan Chen, Carl Schissler|arXiv (Cornell University)|Jun 16, 2022

Speech and Audio Processing被引用 22

一句话总结

SoundSpaces 2.0 是一个基于几何的音频渲染平台，能够为任意3D环境提供即时的现实声学渲染，实现连续的空间采样、可配置的材料/麦克风，以及用于音视频任务的 sim2real 评估。

ABSTRACT

We introduce SoundSpaces 2.0, a platform for on-the-fly geometry-based audio rendering for 3D environments. Given a 3D mesh of a real-world environment, SoundSpaces can generate highly realistic acoustics for arbitrary sounds captured from arbitrary microphone locations. Together with existing 3D visual assets, it supports an array of audio-visual research tasks, such as audio-visual navigation, mapping, source localization and separation, and acoustic matching. Compared to existing resources, SoundSpaces 2.0 has the advantages of allowing continuous spatial sampling, generalization to novel environments, and configurable microphone and material properties. To our knowledge, this is the first geometry-based acoustic simulation that offers high fidelity and realism while also being fast enough to use for embodied learning. We showcase the simulator's properties and benchmark its performance against real-world audio measurements. In addition, we demonstrate two downstream tasks -- embodied navigation and far-field automatic speech recognition -- and highlight sim2real performance for the latter. SoundSpaces 2.0 is publicly available to facilitate wider research for perceptual systems that can both see and hear.

研究动机与目标

实现与视觉环境相匹配的即时几何基础音频渲染。
将音频仿真推广到任意3D网格和新环境。
提供可配置的麦克风设置和材料属性以实现逼真的声学效果。
以真实测量为基准评估逼真度，并在下游任务（AV 导航、远场 ASR）上评估 sim2real 性能。
发布大规模视觉-声学数据集（SoundSpaces-PanoIR），以支持感知系统的视觉-听觉研究。

提出的方法

基于双向路径追踪的音频传播，计算给定声源/接收点位置和场景几何的室内冲击响应（RIR）。
在可配置频带上的频域渲染，结合能量-时间直方图和球面调和用于定向能量分布。
通过 HRTF 将接收信号空间化为双耳或全指向格式。
对持续的声源和听者运动进行声学连续性建模，在相邻观测之间进行跨淡化。
仿真可配置性：采样率、频带、射线数量、绕射/反射/透射、麦克风类型和可加载的 HRTF。
包含29种内置声学材料的材料建模，以及频率相关的吸收/散射/透射，外加空气吸收和距离相关的阻尼。
两种渲染模式（高速度和高质量），在效率与最大保真之间在射线/重用先前计算的 IR 之间取舍；并行多线程提升性能。

实验结果

研究问题

RQ1SoundSpaces 2.0 的音视频仿真与真实测量相比有多么准确？
RQ2在 SoundSpaces 2.0 训练的机器学习模型对真实数据（sim2real）的泛化能力如何，特别是对于连续音视频导航和远场 ASR 等任务？
RQ3声学随机化是否提升下游音视频任务的 sim2real 泛化？
RQ4SoundSpaces 2.0 是否能够为任意新的环境进行渲染并支持超越离散网格的连续空间采样？
RQ5连续声学（相对于离散）对 AV-导航表现和真实感有何影响？

主要发现

SoundSpaces 2.0 与真实测量的一致性比前代 SoundSpaces 更高，尤其在直接到混响比 (DRR) 精度方面显著提升（平均 DRR 误差从 11.0 dB 降至 0.98 dB）。
在速度-精度权衡中，高速渲染比高质量大约快8倍（单线程），比高质量快约33倍（5 线程），RT60 误差约9.5% 对比高质量的0.0%；下游导航性能保持竞争力。
连续声学提升 AV 导航真实感：使用 SoundSpaces 2.0 训练的代理在对比使用离散空间或缺乏声学连续性的基线时表现更好，显示出联合空间与声学连续性的重要性。
在远场 ASR 中，用 SoundSpaces 2.0 的 IR 进行微调的词错误率低于基线（例如 SoundSpaces 2.0 为 12.48% WER，而预训练为 29.10%，且高于某些真实 IR 的微调）；声学随机化进一步将 WER 降至 12.04%。
作者发布 SoundSpaces-PanoIR：覆盖 750 个环境（Gibson、Matterport3D、HM3D）的一千万对全景图像-IR 数据对，用于支持视觉-声学学习。
SoundSpaces 2.0 可推广到任意网格（Gibson、HM3D、Ego4D、Matterport3D、Replica），并支持可配置的麦克风阵列和材料，便于更广泛的研究使用。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。