[论文解读] Learning to Detect Objects with a 1 Megapixel Event Camera
本文提出了一种高分辨率的1Mpx事件相机对象检测器,使用基于循环ConvLSTM的架构,发布了一个大规模1Mpx汽车检测数据集,并在不重建灰度图像的情况下实现了基于帧的检测器对等。
Event cameras encode visual information with high temporal precision, low data-rate, and high-dynamic range. Thanks to these characteristics, event cameras are particularly suited for scenarios with high motion, challenging lighting conditions and requiring low latency. However, due to the novelty of the field, the performance of event-based systems on many vision tasks is still lower compared to conventional frame-based solutions. The main reasons for this performance gap are: the lower spatial resolution of event sensors, compared to frame cameras; the lack of large-scale training datasets; the absence of well established deep learning architectures for event-based processing. In this paper, we address all these problems in the context of an event-based object detection task. First, we publicly release the first high-resolution large-scale dataset for object detection. The dataset contains more than 14 hours recordings of a 1 megapixel event camera, in automotive scenarios, together with 25M bounding boxes of cars, pedestrians, and two-wheelers, labeled at high frequency. Second, we introduce a novel recurrent architecture for event-based detection and a temporal consistency loss for better-behaved training. The ability to compactly represent the sequence of events into the internal memory of the model is essential to achieve high accuracy. Our model outperforms by a large margin feed-forward event-based architectures. Moreover, our method does not require any reconstruction of intensity images from events, showing that training directly from raw events is possible, more efficient, and more accurate than passing through an intermediate intensity image. Experiments on the dataset introduced in this work, for which events and gray level images are available, show performance on par with that of highly tuned and studied frame-based detectors.
研究动机与目标
- 发布第一批大规模高分辨率事件基对象检测数据集(1 Megapixel)在汽车场景,含25M边界框。
- 开发一个具备记忆能力的递归架构,直接从原始事件中检测对象,而不重建强度帧。
- 引入时序一致性损失,以提高随时间的定位稳定性。
- 演示事件基检测在大规模任务上可媲美基于帧的检测器。
- 提供消融研究并与最先进的事件基和基于帧的检测器进行基准比较。
提出的方法
- 将事件预处理为每个时间间隔的密集张量映射H_k (C x M x N)。
- 使用带有Squeeze-and-Excitation块的前馈CNN从H_k提取特征。
- 融入ConvLSTM层,形成一个具备记忆的时空检测器。
- 将一个SSD风格的回归/分类头附加到来自循环层的多尺度特征。
- 训练时损失包含回归L_r (平滑L1)、分类L_c (softmax焦点损失)和时序一致性损失L_t(双回归头预测B_k和B’_{k+1})。
- 可选扩展使用其他检测器家族(例如 RetinaNet)与循环特征提取器。
实验结果
研究问题
- RQ1能否在不重建灰度帧的情况下,使用高分辨率事件相机(1Mpx)在人车场景中进行鲁棒检测?
- RQ2与前馈方法相比,基于记忆的递归结构是否提升事件流的检测精度和时序一致性?
- RQ3时序一致性损失如何影响随时间的定位精度?
- RQ4与最先进的事件基和基于帧的检测器在大规模汽车数据集上的性能如何?
- RQ5是否存在一个大规模的自动标注协议,可以产生可用于事件基对象检测的数据集?
主要发现
- 作者公开了首个大规模1 Megapixel事件相机检测数据集,包含14.65小时驾驶数据和2500万边界框。
- 一个基于递归ConvLSTM的检测器(RED)结合多尺度SSD风格头,在1Mpx数据集上达到事件基方法中的最先进性能。
- 直接事件基检测(不进行强度重建)在1Mpx数据集的准确性上与基于帧的检测器相当,并超越若干事件基基线。
- 提出的时序一致性损失(L_t)将mAP提升约2个百分点,mAP_75提升约4个百分点,并提高随时间的IoU稳定性。
- 内部状态非零的记忆驱动对性能至关重要:移除记忆约降低约12个百分点。
- RED在准确性和速度上超越如Events-RetinaNet和E2Vid-RetinaNet等替代方案,在1Mpx数据集上比E2Vid-RetinaNet快21倍。
- 模型对夜间序列和不同相机类型具有良好泛化能力,说明事件基表示对光照和传感器变化具有鲁棒性。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。