QUICK REVIEW

[논문 리뷰] Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach

André O. Françani, Marcos R. O. A. Máximo|arXiv (Cornell University)|2023. 05. 10.

Advanced Vision and Imaging인용 수 8

한 줄 요약

이 논문은 모노큘러 비주얼 오도메트리(V0)를 영상 이해 과제로 간주하는 엔드-투-엔드 트랜스포머 기반 아키텍처인 TSformer-VO를 제시하여, 클립으로부터 6-DoF 카메라 자세를 추정하고 KITTI에서 기하학 기반 및 다른 딥러닝 기반 VO 방법과 비교해 경쟁력 있는 결과를 얻는다.

ABSTRACT

Estimating the camera's pose given images from a single camera is a traditional task in mobile robots and autonomous vehicles. This problem is called monocular visual odometry and often relies on geometric approaches that require considerable engineering effort for a specific scenario. Deep learning methods have been shown to be generalizable after proper training and with a large amount of available data. Transformer-based architectures have dominated the state-of-the-art in natural language processing and computer vision tasks, such as image and video understanding. In this work, we deal with the monocular visual odometry as a video understanding task to estimate the 6 degrees of freedom of a camera's pose. We contribute by presenting the TSformer-VO model based on spatio-temporal self-attention mechanisms to extract features from clips and estimate the motions in an end-to-end manner. Our approach achieved competitive state-of-the-art performance compared with geometry-based and deep learning-based methods on the KITTI visual odometry dataset, outperforming the DeepVO implementation highly accepted in the visual odometry community. The code is publicly available at https://github.com/aofrancani/TSformer-VO.

연구 동기 및 목표

트랜스포머를 활용한 비디오 이해로 수작업 기하 모듈 없이 엔드-투-엔드 모노큘러 VO를 고취시키는 것.
이미지 클립에서 시공간 특징을 추출하고 6-DoF 포즈를 회귀하도록 TSformer-VO를 개발한다.
KITTI에서 기하 기반 및 딥러닝 VO 방법과 비교해 경쟁력 있는 성능을 입증한다.
재현성과 커뮤니티 채택을 촉진하기 위해 코드와 사전 학습된 모델을 공유한다.

제안 방법

모노큘러 VO를 영상 이해 과제로 간주하고 프레임 시퀀스에서 6-DoF 포즈를 추정한다.
TimeSformer에서 영감을 얻은 분리된 시공간 자체 주의를 사용해 시간적 및 공간적 의존성을 효율적으로 모델링한다.
Nf 프레임 클립에 대한 상대 포즈를 예측하기 위해 MSE 회귀 손실을 적용한다( Nf-1 포즈를 산출).
절대 포즈를 상대 변환으로 변환하고 회전을 위한 오일러 각을 인코딩하는 전처리를 수행한다.
역정규화하고 오일러 각을 다시 회전 표현으로 변환하며 중첩된 클립에서 반복되는 포즈 추정치를 평균화하는 후처리.
KITTI 시퀀스에서 엔드-투-엔드 감독 하에 학습하며, Nf 프레임의 슬라이딩 윈도우를 사용하고 평가 시 7-DoF 정렬을 수행한다.

Figure 1: Traditional pipeline for visual odometry.

실험 결과

연구 질문

RQ1트랜스포머 기반 영상 이해 모델이 단안 비디오 클립에서 6-DoF 카메라 포즈를 정확히 회귀할 수 있는가?
RQ2모노큘러 VO에서 분리된 시공간 주의가 결합된 시공간 주의에 비해 정확도와 효율성 측면에서 어떻게 비교되는가?
RQ3클립 길이(Nf)와 겹치는 윈도잉이 포즈 추정 및 스케일 드리프트에 어떤 영향을 미치는가?
RQ4KITTI 시퀀스에서 기준선(ORB-SLAM2, DeepVO) 대비 TSformer-VO의 성능은 어떤가?

주요 결과

TSformer-VO는 기하 기반 및 엔드-투-엔드 딥러닝 방법과 비교할 때 KITTI 오도메트리 벤치마크에서 경쟁력 있는 성능을 달성한다.
엔드-투-엔드 접근법 중에서 TSformer-VO 변형들이 대부분의 지표와 시퀀스에서 DeepVO를 능가하여 VO에서 트랜스포머의 강점을 강조한다.
분리된 시공간 어텐션은 결합 어텐션 변형과 비교해 정확도를 유지하면서 계산 효율성을 제공한다.
시공간 어텐션 시각화는 모델이 정적인 장면 영역에 주로 집중하고 움직이는 물체를 무시하며, 키포인트보다 blob 모양 영역을 선호함을 보여준다.
추론 시간은 클립 길이에 따라 증가: TSformer-VO-1 ≈ 20.3 ms/clip, TSformer-VO-2 ≈ 28.8 ms/clip, TSformer-VO-3 ≈ 37.9 ms/clip, 최적화를 통한 실시간 적용 가능성을 시사한다.
이 모델은 모노큘러 VO에 존재하는 스케일 드리프트를 자연스럽게 처리하며, 고속 시나리오에서 고전적인 특징 기반 방법이 어려움을 겪는 경우에 엔드-투-엔드 학습의 이점을 얻는다.

Figure 2: TSformer-VO pipeline. The input clips with $N_{f}$ frames are processed into $N$ patches. Each patch is embedded into tokens and is sent to the sequence of Tranformer blocks. A special vector called class token (cls) gathers the information from all patches and passes through the final MLP

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.