QUICK REVIEW

[論文レビュー] Learning to Detect Objects with a 1 Megapixel Event Camera

Etienne Pérot, Pierre de Tournemire|arXiv (Cornell University)|Sep 28, 2020

Advanced Memory and Neural Computing参考文献 65被引用数 142

ひとこと要約

この論文は、再帰的 ConvLSTM ベースのアーキテクチャを用いた高解像度1Mpxイベントカメラ用の物体検出器を提案し、グレースケール画像を再構成せずに大規模な1Mpx自動車検出データセットを公開し、フレームベースの検出器と同等の性能を達成した。

ABSTRACT

Event cameras encode visual information with high temporal precision, low data-rate, and high-dynamic range. Thanks to these characteristics, event cameras are particularly suited for scenarios with high motion, challenging lighting conditions and requiring low latency. However, due to the novelty of the field, the performance of event-based systems on many vision tasks is still lower compared to conventional frame-based solutions. The main reasons for this performance gap are: the lower spatial resolution of event sensors, compared to frame cameras; the lack of large-scale training datasets; the absence of well established deep learning architectures for event-based processing. In this paper, we address all these problems in the context of an event-based object detection task. First, we publicly release the first high-resolution large-scale dataset for object detection. The dataset contains more than 14 hours recordings of a 1 megapixel event camera, in automotive scenarios, together with 25M bounding boxes of cars, pedestrians, and two-wheelers, labeled at high frequency. Second, we introduce a novel recurrent architecture for event-based detection and a temporal consistency loss for better-behaved training. The ability to compactly represent the sequence of events into the internal memory of the model is essential to achieve high accuracy. Our model outperforms by a large margin feed-forward event-based architectures. Moreover, our method does not require any reconstruction of intensity images from events, showing that training directly from raw events is possible, more efficient, and more accurate than passing through an intermediate intensity image. Experiments on the dataset introduced in this work, for which events and gray level images are available, show performance on par with that of highly tuned and studied frame-based detectors.

研究の動機と目的

イベントベース物体検出のための初の大規模・高解像度データセットを公開（1メガピクセル）で自動車シナリオと25M境界ボックス。
生データイベントから強化されたメモリ機能付き再帰アーキテクチャを開発し、強度フレームの再構成なしで物体を検出する。
ローカリゼーションの安定性を向上させる時間的一貫性損失を導入。
イベントベース検出が大規模タスクでフレームベース検出器と同等の性能を発揮できることを示す。
アブレーションと最新のイベントベースおよびフレームベース検出器とのベンチマークを提供。

提案手法

時間間隔ごとにH_k (C x M x N) の密なテンソルマップにイベントを前処理する。
Squeeze-and-Excitation ブロックを用いた前向きCNNで H_k から特徴を抽出する。
ConvLSTM 層を組み込んでメモリ機能を備えた時空間検出器を形成する。
再帰層からのマルチスケール特徴量に単一ショット検出器（SSD）スタイルの回帰/分類ヘッドを装着する。
回帰 L_r（平滑化L1）、分類 L_c（ソフトマックス焦点損失）、および時間的一貫性損失 L_t（B_k と B’_{k+1} を予測するデュアル回帰ヘッド）を組み合わせた損失で訓練する。
オプションで他の検出器ファミリー（例：RetinaNet）と再帰特徴抽出器を組み合わせて拡張する。

実験結果

リサーチクエスチョン

RQ1高解像度のイベントカメラ（1Mpx）を用いて、グレースケール画像を再構成せずに自動車シナリオで頑健な物体検出が可能か？
RQ2メモリベースの再帰アーキテクチャは、 feed-forward アプローチと比較してイベント列上の検出精度と時間的一貫性を改善するか？
RQ3時間的一貫性損失は時間を通じた位置推定の精度にどのような影響を与えるか？
RQ4提案手法は大規模な自動車データセットで最先端のイベントベースおよびフレームベース検出器と比較してどの程度の性能を示すか？
RQ5大規模な自動ラベリングプロトコルはイベントベース物体検出の実用的なデータセットを生み出すことができるか？

主な発見

著者らは、運転データ14.65時間と25Mの境界ボックスを含む、初の大規模1メガピクセルイベントカメラ検出データセットを公開する。
多段SSD風ヘッドを備えた再帰ConvLSTMベースの検出器（RED）は、1Mpxデータセット上でイベントベース手法の中で最先端の性能を達成。
直接イベントベース検出（強度再構成なし）は1Mpxデータセットでフレームベース検出器と精度を同等にし、いくつかのイベントベースベースラインより優れている。
提案する時間的一貫性損失（L_t）はmAPを約2ポイント、mAP_75を約4ポイント改善し、時間の中でIoUの安定性を高める。
メモリドライブ（内部状態が非ゼロの状態）の重要性: memoryを除去すると性能は約12ポイント低下。
REDはEvents-RetinaNetやE2Vid-RetinaNetと比べて精度と速度の両方で優れており、1Mpxデータセット上でE2Vid-RetinaNetの21x速い。
モデルは夜間シーケンスや異なるカメラタイプに対して一般化し、イベントベース表現の照明やセンサ変動に対する頑健性を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。