QUICK REVIEW

[論文レビュー] Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach

André O. Françani, Marcos R. O. A. Máximo|arXiv (Cornell University)|May 10, 2023

Advanced Vision and Imaging被引用数 8

ひとこと要約

本論文は TSformer-VO を提案する。これは単眼ビジュアルオドメトリをビデオ理解タスクとして扱い、クリップから6-DoFカメラ姿勢を推定するエンドツーエンドのトランスフォーマーベースのアーキテクチャで、KITTI において幾何学的手法や他の深層学習ベースの VO 手法と競合する結果を示す。

ABSTRACT

Estimating the camera's pose given images from a single camera is a traditional task in mobile robots and autonomous vehicles. This problem is called monocular visual odometry and often relies on geometric approaches that require considerable engineering effort for a specific scenario. Deep learning methods have been shown to be generalizable after proper training and with a large amount of available data. Transformer-based architectures have dominated the state-of-the-art in natural language processing and computer vision tasks, such as image and video understanding. In this work, we deal with the monocular visual odometry as a video understanding task to estimate the 6 degrees of freedom of a camera's pose. We contribute by presenting the TSformer-VO model based on spatio-temporal self-attention mechanisms to extract features from clips and estimate the motions in an end-to-end manner. Our approach achieved competitive state-of-the-art performance compared with geometry-based and deep learning-based methods on the KITTI visual odometry dataset, outperforming the DeepVO implementation highly accepted in the visual odometry community. The code is publicly available at https://github.com/aofrancani/TSformer-VO.

研究の動機と目的

ハンドクラフト的な幾何モジュールに頼らず、ビデオ理解とトランスフォーマーを活用してエンドツーエンドの単眼 VO を動機づける。
TSformer-VO を開発し、時空的特徴を抽出して画像クリップから 6-DoF 姿勢を回帰する。
KITTI において幾何ベースおよび深層学習 VO 手法と競合する性能を示す。
再現性とコミュニティの適用を促進するため、コードと事前学習モデルを共有する。

提案手法

単眼 VO をビデオ理解タスクとして扱い、フレーム列から 6-DoF 姿勢を推定する。
TimeSformer に触発された分割空間-時間自己注意を用いて、時空の依存関係を効率的にモデル化する。
Nf フレームクリップに対して相対姿勢を予測する MSE 回帰損失を適用する（Nf-1 姿勢を得る）。
絶対姿勢を相対変換へ変換し、回転のオイラー角をエンコードして前処理を行う。
デノーマライズを行い、オイラー角を回転表現へ再変換し、重なりクリップからの繰り返し推定を平均化して後処理を行う。
KITTIS sequences でエンドツーエンドの教師あり学習を用いて訓練し、評価時には Nf フレームのスライディングウィンドウと 7-DoF アライメントを採用する。

Figure 1: Traditional pipeline for visual odometry.

実験結果

リサーチクエスチョン

RQ1トランスフォーマー based のビデオ理解モデルは、単眼ビデオクリップから 6-DoF カメラ姿勢を正確に回帰できるか。
RQ2分割空間-時間注意は、単眼 VO におけるジョイント空間-時間注意と比較して、精度と効率の点でどう異なるか。
RQ3クリップ長さ（Nf）とオーバーラップ窓が、姿勢推定とスケールドリフトに与える影響は何か。
RQ4TSformer-VO は KITTI の Ground Truth を用いたシーケンスで、既存の VO ベースライン（ORB-SLAM2、DeepVO）とどのように比較されるか。

主な発見

TSformer-VO は、幾何ベースおよびエンドツーエンド深層学習法と比較して KITTI オドメトリのベンチマークで競争力のある性能を達成する。
エンドツーエンドアプローチの中で、TSformer-VO のバリアントはほとんどの指標とシーケンスで DeepVO を上回り、VO におけるトランスフォーマーの利点を示している。
分割空間-時間注意は、結合注意バリアントと比較して計算効率を維持しつつ精度を保つ。
時空注意の可視化は、モデルが動く物体を無視し、静的なシーン領域にFocusし、キーポイントよりもブロブ状の領域を好むことを示している。
推論時間はクリップ長に比例して増加する：TSformer-VO-1 ≈ 20.3 ms/clip、TSformer-VO-2 ≈ 28.8 ms/clip、TSformer-VO-3 ≈ 37.9 ms/clip であり、最適化によりリアルタイム適用の可能性を示唆している。
単眼 VO に存在するスケールドリフトをモデルが自然に扱い、古典的な特徴量ベース手法が苦労する高速なシナリオでエンドツーエンド学習が有利である。

Figure 2: TSformer-VO pipeline. The input clips with $N_{f}$ frames are processed into $N$ patches. Each patch is embedded into tokens and is sent to the sequence of Tranformer blocks. A special vector called class token (cls) gathers the information from all patches and passes through the final MLP

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。