QUICK REVIEW

[논문 리뷰] Learning to Detect Objects with a 1 Megapixel Event Camera

Etienne Pérot, Pierre de Tournemire|arXiv (Cornell University)|2020. 09. 28.

Advanced Memory and Neural Computing참고 문헌 65인용 수 142

한 줄 요약

이 논문은 재구성 없이 그레이스케일 프레임을 재구성하지 않고도 프레임 기반 탐지기와 동등한 성능을 달성하는 고해상도 1Mpx 이벤트 카메라 객체 탐지기를 사용하는 순환 ConvLSTM 기반 아키텍처를 제시하고, 대형 1Mpx 자동차 탐지 데이터 세트를 공개하며, 프레임 기반 탐지기와의 동등성(equivalence)을 달성합니다.

ABSTRACT

Event cameras encode visual information with high temporal precision, low data-rate, and high-dynamic range. Thanks to these characteristics, event cameras are particularly suited for scenarios with high motion, challenging lighting conditions and requiring low latency. However, due to the novelty of the field, the performance of event-based systems on many vision tasks is still lower compared to conventional frame-based solutions. The main reasons for this performance gap are: the lower spatial resolution of event sensors, compared to frame cameras; the lack of large-scale training datasets; the absence of well established deep learning architectures for event-based processing. In this paper, we address all these problems in the context of an event-based object detection task. First, we publicly release the first high-resolution large-scale dataset for object detection. The dataset contains more than 14 hours recordings of a 1 megapixel event camera, in automotive scenarios, together with 25M bounding boxes of cars, pedestrians, and two-wheelers, labeled at high frequency. Second, we introduce a novel recurrent architecture for event-based detection and a temporal consistency loss for better-behaved training. The ability to compactly represent the sequence of events into the internal memory of the model is essential to achieve high accuracy. Our model outperforms by a large margin feed-forward event-based architectures. Moreover, our method does not require any reconstruction of intensity images from events, showing that training directly from raw events is possible, more efficient, and more accurate than passing through an intermediate intensity image. Experiments on the dataset introduced in this work, for which events and gray level images are available, show performance on par with that of highly tuned and studied frame-based detectors.

연구 동기 및 목표

이벤트 기반 객체 탐지를 위한 대규모 고해상도 데이터세트(1 메가픽셀)를 자동차 시나리오와 2,500만 개 바운딩 박스로 공개한다.
원시 이벤트에서 강도 프레임을 재구성하지 않고 객체를 탐지하기 위해 메모리- enabled 순환 아키텍처를 개발한다.
위치 추정의 시간적 안정성을 향상시키기 위한 시간적 일관성 손실을 도입한다.
대규모 작업에서 이벤트 기반 탐지기가 프레임 기반 탐지기와 일치할 수 있음을 시연한다.
가용한 상태의 비교를 위한 ablations를 제공하고 최신 이벤트 기반 및 프레임 기반 탐지기와 벤치마크한다.

제안 방법

각 시간 간격마다 H_k (C x M x N)로 이벤트를 밀집 텐서 맵으로 전처리한다.
H_k에서 특징을 추출하기 위해 Squeeze-and-Excitation 블록을 갖춘 피드포워드 CNN을 사용한다.
메모리- enabled 시공간 탐지기를 형성하기 위해 ConvLSTM 계층을 도입한다.
다중 스케일 특징에서 순환 계층으로부터 SSD 스타일 회귀/분류 헤드를 붙인다.
회귀 L_r(스무딩된 L1), 분류 L_c(소프트맥스 포커스 로스), 시간적 일관성 손실 L_t(두 개의 회귀 헤드가 B_k와 B’_{k+1}를 예측)를 결합한 손실로 학습한다.
필요 시 다른 탐지기 계열(예: RetinaNet)과 함께 순환 특징 추출기를 사용하는 확장을 고려한다.

실험 결과

연구 질문

RQ1고해상도 이벤트 카메라(1Mpx)가 그레이스케일 프레임을 재구성하지 않고도 자동차 시나리오에서 강건한 객체 탐지에 사용될 수 있는가?
RQ2메모리 기반 순환 아키텍처가 피드포워드 접근법에 비해 이벤트 스트림에서 탐지 정확도와 시간적 일관성을 개선하는가?
RQ3시간적 일관성 손실이 시간에 따른 위치 정확도에 어떤 영향을 미치는가?
RQ4제안된 방법의 대규모 자동차 데이터셋에서의 성능은 최신 이벤트 기반 및 프레임 기반 탐지기와 비교해 어떤가?
RQ5대규모 자동 레이블링 프로토콜이 이벤트 기반 객체 탐지를 위한 사용 가능한 데이터셋을 산출할 수 있는가?

주요 결과

저자는 14.65시간의 주행 데이터와 2,500만 개의 바운딩 박스를 포함한 첫 대규모 1 메가픽셀 이벤트 카메라 탐지 데이터세트를 공개한다.
다중 스케일 SSD 스타일 헤드를 갖춘 순환 ConvLSTM 기반 탐지기(RED)는 1Mpx 데이터세트에서 이벤트 기반 방법 중 최첨단 성능을 달성한다.
감도 강도 재구성 없이 직접 이벤트 기반 탐지는 1Mpx 데이터세트에서 프레임 기반 탐지기와 정확도에서 일치하고 여러 이벤트 기반 기준선을 능가한다.
제안된 시간적 일관성 손실(L_t)은 mAP를 약 2%포인트, mAP_75를 약 4%포인트 향상시키고 시간에 따른 IoU 안정성을 증가시킨다.
메모리 드라이브(내부 상태가 0이 아님)가 중요한 역할을 하며, 메모리를 제거하면 성능이 약 12%포인트 감소한다.
RED는 E2Vid-RetinaNet 및 Events-RetinaNet과 같은 대안보다 정확도와 속도 면에서 우수하며, 1Mpx 데이터세트에서 E2Vid-RetinaNet보다 21배 빠르다.
모델은 밤 sequence 및 서로 다른 카메라 유형에 걸쳐 일반화되며, 이벤트 기반 표현이 조명 및 센서 변이에 대해 견고함을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.