QUICK REVIEW

[논문 리뷰] End-to-End Multi-View Fusion for 3D Object Detection in LiDAR Point Clouds

Yin Zhou, Pei Sun|arXiv (Cornell University)|2019. 10. 15.

Advanced Neural Network Applications인용 수 174

한 줄 요약

엔드-투-엔드 다중 뷰 융합 (MVF) 프레임워크를 제안하며, BEV와 원근 뷰를 결합하기 위한 동적 보셀화(dynamic voxelization)를 활용하여 LiDAR으로부터의 3D 객체 탐지를 개선하고, Waymo 및 KITTI 데이터셋에서 단일 뷰 베이스라인보다 우수한 정확도를 달성한다.

ABSTRACT

Recent work on 3D object detection advocates point cloud voxelization in birds-eye view, where objects preserve their physical dimensions and are naturally separable. When represented in this view, however, point clouds are sparse and have highly variable point density, which may cause detectors difficulties in detecting distant or small objects (pedestrians, traffic signs, etc.). On the other hand, perspective view provides dense observations, which could allow more favorable feature encoding for such cases. In this paper, we aim to synergize the birds-eye view and the perspective view and propose a novel end-to-end multi-view fusion (MVF) algorithm, which can effectively learn to utilize the complementary information from both. Specifically, we introduce dynamic voxelization, which has four merits compared to existing voxelization methods, i) removing the need of pre-allocating a tensor with fixed size; ii) overcoming the information loss due to stochastic point/voxel dropout; iii) yielding deterministic voxel embeddings and more stable detection outcomes; iv) establishing the bi-directional relationship between points and voxels, which potentially lays a natural foundation for cross-view feature fusion. By employing dynamic voxelization, the proposed feature fusion architecture enables each point to learn to fuse context information from different views. MVF operates on points and can be naturally extended to other approaches using LiDAR point clouds. We evaluate our MVF model extensively on the newly released Waymo Open Dataset and on the KITTI dataset and demonstrate that it significantly improves detection accuracy over the comparable single-view PointPillars baseline.

연구 동기 및 목표

같은 LiDAR의 BEV와 원근 뷰에서 보완 정보를 활용하여 3D 객체 탐지를 향상시키는 동기를 부여합니다.
포인트 수준에서 작동하는 엔드-투-엔드 MVF 아키텍처를 개발하여 효과적인 교차 뷰 특징 융합을 구현합니다.
모든 포인트를 보존하고 결정론적 보셀 임베딩을 가능하게 하는 동적 보셀화를 도입합니다.
동적 보셀화가 적용된 MVF가 Waymo Open Dataset과 KITTI에서 단일 뷰 베이스라인을 능가함을 보여줍니다.

제안 방법

각 LiDAR 포인트를 고차원 특징 공간에 임베딩합니다. BEV(카르테시안) 및 원근 뷰(구면)에서 동적 보셀화를 적용하여 양방향 포인트-보셀 매핑을 구축합니다.
뷰 의존 특징을 각 뷰별 FC 층으로 계산하고, 최대 풀링으로 보셀 정보를 집계합니다.
(i) BEV 보셀 특징, (ii) 원근 보셀 특징, (iii) 포인트 자체 특징으로부터의 포인트 특징을 융합하여 향상된 포인트 임베딩을 생성합니다.
해상도를 보존하면서 맥락 정보를 포착하기 위해 컨벌루션 타워로 보셀 특징 맵을 처리합니다.
SECOND 및 PointPillars와 동일한 손실로 학습하며, 분류에는 focal loss, 회귀에는 SmoothL1을 사용합니다; Adam과 코사인 학습률 감소로 최적화합니다.

실험 결과

연구 질문

RQ1같은 LiDAR 포인트 클라우드의 이중 뷰(BEV 및 원근) 표현이 상호 보완적 맥락 정보를 제공하여 3D 객체 탐지를 개선할 수 있는가?
RQ2동적 보셀화가 정보를 보존하고 탐지를 안정화하는 데 전통적 하드 보셀화보다 우수한가?
RQ3MVF가 자동차 및 보행자 탐지를 위한 대규모 및 표준 벤치마크(Waymo Open Dataset 및 KITTI)에서 단일 뷰 베이스라인과 어떻게 비교되는가?
RQ4MVF 접근법이 사용된 베이스라인 외의 다른 LiDAR 기반 탐지기에 일반화될 수 있는가?

주요 결과

MVF with dynamic voxelization은 Waymo 차량 및 보행자 작업에서 HV+SV 및 DV+SV 베이스라인보다 탐지 정확도를 일관되게 향상시킵니다.
동적 보셀화는 모든 포인트와 보셀을 보존하여 결정론적 보셀 임베딩을 제공하고 정보 손실을 감소시킵니다.
BEV와 원근 뷰의 결합은 보완적 맥락을 제공하며, 더 긴 거리 및 보행자와 같은 작고 가려진 객체에서 더 큰 이점을 보입니다.
Waymo 데이터세트에서 MVF는 거리 범위(0-30m, 30-50m, 50m 이상)에서 HV+SV 및 DV+SV보다 더 높은 BEV 및 3D AP를 달성합니다.
KITTI에서 MVF는 3D 자동차 탐지 성능이 경쟁력 있으며, 쉬움/보통/어려운 설정에서 HV+SV 및 DV+SV를 능가합니다.
MVF는 기준 방법에 비해 바람직한 지연 특성을 보여 실시간 추론을 가능하게 합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.