QUICK REVIEW

[논문 리뷰] PSTNet: Point Spatio-Temporal Convolution on Point Cloud Sequences

Hehe Fan, Xin Yu|arXiv (Cornell University)|2022. 05. 27.

Human Pose and Action Recognition참고 문헌 58인용 수 69

한 줄 요약

PSTNet은 동적 포인트 클라우드의 공간과 시간을 분리하는 포인터 기반 시공간 컨볼루션을 도입하여 3D 동작 인식과 4D 의미론적 분할을 위한 계층적 네트워크를 형성합니다.

ABSTRACT

Point cloud sequences are irregular and unordered in the spatial dimension while exhibiting regularities and order in the temporal dimension. Therefore, existing grid based convolutions for conventional video processing cannot be directly applied to spatio-temporal modeling of raw point cloud sequences. In this paper, we propose a point spatio-temporal (PST) convolution to achieve informative representations of point cloud sequences. The proposed PST convolution first disentangles space and time in point cloud sequences. Then, a spatial convolution is employed to capture the local structure of points in the 3D space, and a temporal convolution is used to model the dynamics of the spatial regions along the time dimension. Furthermore, we incorporate the proposed PST convolution into a deep network, namely PSTNet, to extract features of point cloud sequences in a hierarchical manner. Extensive experiments on widely-used 3D action recognition and 4D semantic segmentation datasets demonstrate the effectiveness of PSTNet to model point cloud sequences.

연구 동기 및 목표

복셀화나 트래킹 없이 동적이고 불규칙한 포인트 클라우드를 모델링하도록 동기를 부여한다.
포인트 시퀀스에서 공간 구조를 시간적 다이나믹스로부터 분리하는 PST 컨볼루션을 제안한다.
시퀀스 수준 분류와 포인트 수준 예측을 위한 PSTNet 아키텍처를 구축한다.
3D 동작 인식 및 4D 의미론적 분할 벤치마크에서 유효성을 입증한다.

제안 방법

포인트 클라우드 시퀀스에서 공간과 시간을 분리하고 PST 컨볼루션을 정의한다.
학습된 변위 기반 커널 함수 f(delta; theta)를 사용하여 로컬 3D 이웃에서 공간 컨볼루션을 수행한다.
동적 특성을 포착하기 위해 로컬 프레임 시퀀스에 대한 시간적 컨볼루션.
시간 앵커 프레임과 FPS 기반 공간 앵커를 사용하여 시공간 컨볼루션을 가능하게 하는 포인트 튜브를 구성한다.
밀집 포인트 수준 예측을 위한 특징을 업샘플링하고 보간하기 위해 PST 전치합성을 도입한다.
액션 인식 및 의미론적 분할을 위해 다수의 PST 계층(및 전치 계층)으로 PSTNet 아키텍처를 구성한다.

실험 결과

연구 질문

RQ1공간 구조와 시간적 다이나믹스를 분리하는 것이 동적 포인트 클라우드의 학습을 향상시킬 수 있는가?
RQ2이전 방법들과 비교해 PSTNet이 3D 동작 인식 및 4D 의미론적 분할에서 더 높은 정확도와 효율성을 제공하는가?
RQ3시간적 커널 크기와 공간 반경이 포인트 클라우드 시퀀스 작업에서 성능에 어떻게 영향을 미치는가?

주요 결과

Method	Input	Frames	Accuracy (%)
Vieira et al.	depth	20	78.20
Kläser et al.	depth	18	81.43
Actionlet	skeleton	all	88.21
PointNet++	point	1	61.61
MeteorNet	point	4	78.11
MeteorNet	point	8	81.14
MeteorNet	point	12	86.53
MeteorNet	point	16	88.21
MeteorNet	point	24	88.50
PSTNet (ours)	point	4	81.14
PSTNet (ours)	point	8	83.50
PSTNet (ours)	point	12	87.88
PSTNet (ours)	point	16	89.90
PSTNet (ours)	point	24	91.20

PSTNet은 MSR-Action3D에서 최첨단 성능을 달성하고, 최대 24 프레임까지의 프레임 설정에서 이전 방법들을 능가합니다.
NTU RGB+D 60/120에서 PSTNet은 스켈레톤, 깊이 및 보셀 기반 베이스라인에 대해 강한 향상을 보입니다.
Synthia 4D에서 4D 의미론적 분할을 위한 시간 모델링(l=3)을 적용한 PSTNet은 베이스라인보다 우수하고 일부 경쟁사보다 파라미터 수가 더 적습니다.
비교실험은 더 긴 클립과 적절한 시간 커널 크기가 동작 인식을 향상시키는 반면 공간 반경은 로컬 구조 캡처와 구별력을 상충시킴을 나타냅니다.
시각화는 PSTNet이 움직이는 영역에서 더 강하게 활성화되어 효과적인 모션 모델링을 시사합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.