QUICK REVIEW

[論文レビュー] KITTI-360: A Novel Dataset and Benchmarks for Urban Scene Understanding in 2D and 3D

Yiyi Liao, Jun Xie|arXiv (Cornell University)|Sep 28, 2021

Robotics and Sensor-Based Localization被引用数 29

ひとこと要約

tldr: KITTI-360は、密な2Dおよび3Dのセマンティック/インスタンス注釈を備えた地理参照の郊外走行データセットを提供し、映像合成とセマンティックSLAMの新規ビューのベンチマークを提供します。視覚、グラフィックス、ロボティクスの分野を結ぶ連携を促進します。

ABSTRACT

For the last few decades, several major subfields of artificial intelligence including computer vision, graphics, and robotics have progressed largely independently from each other. Recently, however, the community has realized that progress towards robust intelligent systems such as self-driving cars requires a concerted effort across the different fields. This motivated us to develop KITTI-360, successor of the popular KITTI dataset. KITTI-360 is a suburban driving dataset which comprises richer input modalities, comprehensive semantic instance annotations and accurate localization to facilitate research at the intersection of vision, graphics and robotics. For efficient annotation, we created a tool to label 3D scenes with bounding primitives and developed a model that transfers this information into the 2D image domain, resulting in over 150k images and 1B 3D points with coherent semantic instance annotations across 2D and 3D. Moreover, we established benchmarks and baselines for several tasks relevant to mobile perception, encompassing problems from computer vision, graphics, and robotics on the same dataset, e.g., semantic scene understanding, novel view synthesis and semantic SLAM. KITTI-360 will enable progress at the intersection of these research areas and thus contribute towards solving one of today's grand challenges: the development of fully autonomous self-driving systems.

研究の動機と目的

自動運転のための視覚、グラフィックス、ロボティクスの交差領域での学際的進展を促進する。
KITTIより豊富で地理参照されたデータセットを、密な2D/3Dセマンティック/インスタンスラベルとマルチモーダル感知を備えた形で提供する。
効率的な3Dから2Dへのラベル転送を開発し、ビュー間で一貫した注釈を作成する。
新しいデータセット上でのセマンティックなシーン理解、新規ビュー合成、およびセマンティックSLAMのベンチマークを確立する。

提案手法

3Dで注釈された境界プリミティブを用いた3D注釈を導入し、整合性のある2Dピクセル単位および3D点単位のラベルを取得する。
3Dで静的および動的なシーン要素を注釈するWebGLベースの注釈ツールを開発する。
3Dラベルを2Dへ転送するため、3D点と2Dピクセルを同時に推論する非局所的多場CRFを用いる。
学習に基づく事前知識を取り入れ、疎な3D点を画像へ投影してセマンティックセグメンテーションネットワーク（PSPNet）を訓練し、インスタンス仮説を統合する。
複数フレームにわたってステレオとレーザースキャンを融合し、密な3D情報と完全なラベリングのための仮想空ポイントを生成する。

実験結果

リサーチクエスチョン

RQ1屋外の都市/郊外シーンにおいて、2Dと3Dの密で一貫したセマンティックおよびインスタンス注釈をどのように取得できるか？
RQ2CRFを用いた3Dから2Dのラベル転送は、純粋な2Dまたは純粋な3Dアプローチを超えて、ラベリングの一貫性と精度を改善できるか？
RQ3包含的で地理参照された都市データセット上で、セマンティック理解、新規ビュー合成、セマンティックSLAMを評価するための効果的なベンチマークは何か？
RQ43D注釈は、ビデオフレームと360°センサデータを横断する時間的に一貫したインスタンスラベリングを可能にするか？

主な発見

データセットは30万枚超の画像と8万のレーザースキャンを含み、2Dおよび3Dの整合性のあるセマンティックおよびインスタンス注釈を備える。
WebGLベースの3D注釈ツールは静的および動的要素のラベリングを可能にし、密な2D/3Dラベルとフレーム間で一貫したインスタンスIDを生む。
学習済みの unary/pairwise terms を含む非局所的多場CRFによる3D-to-2Dラベル転送は、純粋な2D手法や純粋な学習ベースのアプローチよりラベリングを改善する。
3D注釈と2D投影の統合は、セマンティックシーン理解、新規ビュー合成、セマンティックSLAMといった新規ベンチマークを可能にする。
本論文は、注釈作業が時間効率的であると報告しており（全バッチ約3時間、画像1枚あたりの注釈時間を考慮すると約0.75分）、オンラインベンチマークは保持分以外で難易度が高いと述べている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。