QUICK REVIEW

[论文解读] Cross-Task Transfer for Multimodal Aerial Scene Recognition.

Di Hu, Xuhong Li|arXiv (Cornell University)|May 18, 2020

Speech and Audio Processing被引用 6

一句话总结

本文提出了一种从音频事件识别到航拍场景识别的跨任务迁移学习方法，利用一种新颖的多模态数据集ADVANCE，该数据集将航拍图像与地理标记的声音配对。通过利用与特定地表覆盖类型共现的声音线索，作者通过音视频知识蒸馏和对比学习方法，显著提升了航拍场景分类性能，在ADVANCE数据集上取得了最先进结果。

ABSTRACT

Aerial scene recognition is a fundamental task in remote sensing and has recently received increased interest. While the visual information from overhead images with powerful models and efficient algorithms yields considerable performance on scene recognition, it still suffers from the variation of ground objects, lighting conditions etc. Inspired by the multi-channel perception theory in cognition science, in this paper, for improving the performance on the aerial scene recognition, we explore a novel audiovisual aerial scene recognition task using both images and sounds as input. Based on an observation that some specific sound events are more likely to be heard at a given geographic location, we propose to exploit the knowledge from the sound events to improve the performance on the aerial scene recognition. For this purpose, we have constructed a new dataset named AuDio Visual Aerial sceNe reCognition datasEt (ADVANCE). With the help of this dataset, we evaluate three proposed approaches for transferring the sound event knowledge to the aerial scene recognition task in a multimodal learning framework, and show the benefit of exploiting the audio information for the aerial scene recognition. The source code is publicly available for reproducibility purposes.

研究动机与目标

解决在光照和物体条件变化下仅依赖图像的航拍场景识别所面临的局限性。
探究音频事件信息是否能够提升航拍场景分类的鲁棒性和准确性。
开发一种多模态学习框架，实现从音频事件识别到航拍场景识别的知识迁移。
构建一个新的基准数据集ADVANCE，以支持音视频航拍场景识别研究。

提出的方法

提出一种新颖的音视频航拍场景识别任务，结合高空图像与地理标记的声音事件。
构建ADVANCE数据集，包含来自不同地理区域的同步航拍图像与对应的声音记录。
设计三种跨任务迁移学习方法：基于音频引导的知识蒸馏、使用音频监督的对比学习，以及图像与音频特征的早期融合。
使用图像和音频输入训练多模态模型，利用音频信号引导视觉模态的特征学习。
利用声音事件嵌入作为弱监督信号，以改善在低资源或挑战性视觉条件下的视觉表征学习。
公开发布源代码，以确保可复现性并促进社区采纳。

实验结果

研究问题

RQ1在视觉变化条件下，音频事件信息是否能提升航拍场景识别的准确性？
RQ2从音频事件识别到航拍场景识别的跨任务迁移效果如何？
RQ3在航拍场景分类中，融合图像与音频信号的最有效多模态学习策略是什么？
RQ4引入音频信号是否能降低对视觉域偏移（如光照变化和物体变化）的敏感性？

主要发现

所提出的音视频学习框架在ADVANCE数据集上显著优于仅使用图像的基线模型。
基于音频引导的知识蒸馏在不同场景类别中均表现出最一致的性能提升。
使用音频监督的对比学习增强了特征泛化能力，尤其在低能见度条件下表现更优。
引入音频信号可显著降低在城市等视觉杂乱场景中的错误率。
ADVANCE数据集为多模态遥感研究开辟了新的方向，并为未来工作提供了强有力的基准。
代码与数据的公开发布促进了可复现性，并加速了音视频场景理解领域的进展。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。