QUICK REVIEW

[论文解读] HRFuser: A Multi-resolution Sensor Fusion Architecture for 2D Object Detection

Tim Broedermann, Christos Sakaridis|arXiv (Cornell University)|Jun 30, 2022

Advanced Neural Network Applications被引用 3

一句话总结

HRFuser 是一种用于 2D 目标检测的模块化、多分辨率传感器融合架构，通过一种新型的多窗口交叉注意力（MWCA）模块融合相机、激光雷达、雷达和门控相机输入。该模型在整个网络中保留了高分辨率特征，在 nuScenes 和 DENSE 数据集上实现了最先进性能，仅增加 9.7% FLOPs 和 1.9% 参数即可融合一种新模态。

ABSTRACT

Besides standard cameras, autonomous vehicles typically include multiple additional sensors, such as lidars and radars, which help acquire richer information for perceiving the content of the driving scene. While several recent works focus on fusing certain pairs of sensors - such as camera with lidar or radar - by using architectural components specific to the examined setting, a generic and modular sensor fusion architecture is missing from the literature. In this work, we propose HRFuser, a modular architecture for multi-modal 2D object detection. It fuses multiple sensors in a multi-resolution fashion and scales to an arbitrary number of input modalities. The design of HRFuser is based on state-of-the-art high-resolution networks for image-only dense prediction and incorporates a novel multi-window cross-attention block as the means to perform fusion of multiple modalities at multiple resolutions. We demonstrate via extensive experiments on nuScenes and the adverse conditions DENSE datasets that our model effectively leverages complementary features from additional modalities, substantially improving upon camera-only performance and consistently outperforming state-of-the-art 3D and 2D fusion methods evaluated on 2D object detection metrics. The source code is publicly available.

研究动机与目标

解决自动驾驶中用于多模态 2D 目标检测的通用、模块化传感器融合架构缺乏的问题。
提升在恶劣天气条件下的鲁棒性，因为仅使用相机的模型会因能见度差和缺乏深度信息而失效。
实现对任意数量传感器（如激光雷达、雷达、门控相机）的可扩展融合，而无需为每种模态设计专用组件。
在整个网络中保持高分辨率特征表示，以保留密集预测任务中的精细空间细节。
开发一种高效的融合机制，减少雷达等低质量传感器带来的噪声，同时利用所有模态的互补特征。

提出的方法

HRFuser 通过在主相机分支中保持高分辨率特征，并为每个次要模态添加轻量级高分辨率分支，将高分辨率网络范式扩展到多模态输入。
核心融合机制是多窗口交叉注意力（MWCA）模块，通过在非重叠空间窗口内应用交叉注意力，降低二次方复杂度，实现高效的多分辨率融合。
在相机主干网络的多个特征层级和分辨率上执行融合，实现多模态特征的分层、多尺度集成。
每个次要模态均通过模态特定的轻量级编码器处理，随后通过 MWCA 与相机特征进行融合。
该架构具有模块化特性：添加新传感器仅需新增一个轻量级分支和 MWCA 模块，无需重新设计整体架构。
模型通过在多模态特征上使用标准 2D 检测头（如 CenterNet）进行端到端训练，损失函数针对检测性能进行优化。

实验结果

研究问题

RQ1一种通用、模块化的传感器融合架构能否在多种传感器模态和恶劣条件下有效提升 2D 目标检测性能？
RQ2使用新型注意力机制的多分辨率、多层级融合是否在 2D 检测中优于现有的早期、晚期或中间融合策略？
RQ3当通过高效注意力机制与高分辨率相机特征融合时，噪声较大的传感器（如雷达）在多大程度上能提升检测性能？
RQ4随着额外传感器数量的增加，计算成本如何变化？模型能否保持实时推理效率？
RQ5在缺乏 3D 标注的极端条件下（如浓雾），模型能否仅依赖 2D 标注实现泛化？

主要发现

HRFuser 在包含全部四种模态（RGB、激光雷达、雷达、门控相机）的 nuScenes 测试集上达到 90.15% AP，显著优于仅使用相机的 HRFormer-T（26.5% AP）和 BEVFusion（31.5% AP）。
在 DENSE 数据集的浓雾分割上，HRFuser 达到 89.62% AP，远超仅使用相机的 HRFormer-T（78.68% AP）及其他在 2D 中评估的 SOTA 3D 融合方法。
仅增加一种模态（如激光雷达或雷达）时，FLOPs 仅增加 9.7%，参数量仅增加 1.9%，证明了其高度的计算效率。
在 DENSE 数据集上，MWCA 模块相比标准注意力（CA）提升 1.7% AP，相比 PVTv2-Li-CA 提升 2.0% AP，证明其在过滤噪声和关注关键特征方面的有效性。
定性结果表明，HRFuser 能检测到 HRFormer-T 所遗漏的被遮挡或远处的车辆，尤其在浓雾和降雪条件下，展现出更强的鲁棒性。
消融研究证实，多分辨率、多层级融合结合 MWCA 是必不可少的——若移除该模块，nuScenes 上性能下降超过 1.5 AP 点。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。