QUICK REVIEW

[논문 리뷰] TorchSparse: Efficient Point Cloud Inference Engine

Haotian Tang, Zhijian Liu|arXiv (Cornell University)|2022. 04. 21.

Advanced Neural Network Applications인용 수 40

한 줄 요약

TorchSparse는 계산 규칙성 향상과 데이터 이동 감소를 통해 GPU에서 3D 포인트 클라우드의 희소 합성(convolution)을 가속화하고, 여러 모델/데이터셋에 걸쳐 MinkowskiEngine 및 SpConv보다 엔드 투 엔드 속도에서 최대 약 1.6배의 속도향상을 달성합니다.

ABSTRACT

Deep learning on point clouds has received increased attention thanks to its wide applications in AR/VR and autonomous driving. These applications require low latency and high accuracy to provide real-time user experience and ensure user safety. Unlike conventional dense workloads, the sparse and irregular nature of point clouds poses severe challenges to running sparse CNNs efficiently on the general-purpose hardware. Furthermore, existing sparse acceleration techniques for 2D images do not translate to 3D point clouds. In this paper, we introduce TorchSparse, a high-performance point cloud inference engine that accelerates the sparse convolution computation on GPUs. TorchSparse directly optimizes the two bottlenecks of sparse convolution: irregular computation and data movement. It applies adaptive matrix multiplication grouping to trade computation for better regularity, achieving 1.4-1.5x speedup for matrix multiplication. It also optimizes the data movement by adopting vectorized, quantized and fused locality-aware memory access, reducing the memory movement cost by 2.7x. Evaluated on seven representative models across three benchmark datasets, TorchSparse achieves 1.6x and 1.5x measured end-to-end speedup over the state-of-the-art MinkowskiEngine and SpConv, respectively.

연구 동기 및 목표

AR/VR 및 자율주행에서 3D 포인트 클라우드의 실시간 추론을 촉진한다.
GPU에서의 희소 합성에서 불규칙한 계산 및 데이터 이동 병목 현상을 다룬다.
성능 향상을 위해 적응적 배치 처리와 지역성 인지 데이터 접근에 최적화된 시스템을 제안한다.

제안 방법

정규성을 높이기 위해 FLOPs를 희생하는 적응적 행렬 곱셈 그룹화를 도입하고 GPU 활용도를 향상시킨다.
데이터 이동을 줄이기 위해 양자화, 벡터화, 융합 메모리 접근을 적용하고 scatter/gather를 최적화한다.
매핑 연산을 융합하고 추론 중 데이터 재사용을 극대화하기 위해 지역성 인지 순서를 사용한다.
빠른 희소 합성 추론을 위해 CUDA 백엔드를 구현하면서 PyTorch와 유사한 API를 제공한다.

실험 결과

연구 질문

RQ1적응적 행렬 곱셈 배치가 GPU에서의 희소 합성의 활용도와 속도를 개선할 수 있는가?
RQ2데이터 이동 최적화(벡터화/스캐터-가터, 지역성 인지 접근)가 희소 합성의 런타임을 얼마나 줄이는가?
RQ3일반적인 3D 포인트 클라우드 벤치마크에서 TorchSparse가 최첨단 희소 엔진(MinkowskiEngine, SpConv)과 어떻게 비교되는가?
RQ4제안된 방법이 데이터셋(SemanticKITTI, nuScenes, Waymo)과 모델(MinkUNet, CenterPoint)에서 견고한가?

주요 결과

TorchSparse는 평가된 모델/데이터셋에서 MinkowskiEngine 대비 엔드투엔드 1.6배, SpConv 대비 1.5배의 속도향상을 제공합니다.
적응적 매트릭스곱 그룹화는 매트릭스곱에서 1.4-1.5배의 속도향상을, 다양한 구성에서 1.6배에서 2.0배의 전반적 개선을 제공합니다.
데이터 이동 최적화(벡터화된 scatter/gather, 지역성 인지 메모리 접근이 포함된 FP16 양자화, 융합 커널)는 DRAM 액세스 감소 및 속도향상을 크게 가져오며(데이터 이동에서 최대 약 1.9x).
매핑과 지역성 인지 입출력 접근의 융합은 매핑 오버헤드를 대폭 감소시켜 검출기에서 최대 약 2.3배의 전반적 이득에 기여한다.
TorchSparse는 nuScenes 및 Waymo 데이터셋의 3프레임 모델을 포함하여 다수의 MinkUNet/CenterPoint 구성을 여러 GPUs에서 실시간 추론(≥10 FPS)을 달성한다.
배치를 위한 적응적 데이터셋/하드웨어 특화 조정(ε,S)은 매트릭스곱 성능을 최대 약 1.5배, 구성 간 전반적인 처리량을 향상시킨다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.