QUICK REVIEW

[论文解读] KITTI-360: A Novel Dataset and Benchmarks for Urban Scene Understanding in 2D and 3D

Yiyi Liao, Jun Xie|arXiv (Cornell University)|Sep 28, 2021

Robotics and Sensor-Based Localization被引用 29

一句话总结

KITTI-360 提供一个地理参考的郊区驾驶数据集，具备密集的 2D 和 3D 语义/实例注释，以及用于新视图合成和语义 SLAM 的基准，连接视觉、图形和机器人领域。

ABSTRACT

For the last few decades, several major subfields of artificial intelligence including computer vision, graphics, and robotics have progressed largely independently from each other. Recently, however, the community has realized that progress towards robust intelligent systems such as self-driving cars requires a concerted effort across the different fields. This motivated us to develop KITTI-360, successor of the popular KITTI dataset. KITTI-360 is a suburban driving dataset which comprises richer input modalities, comprehensive semantic instance annotations and accurate localization to facilitate research at the intersection of vision, graphics and robotics. For efficient annotation, we created a tool to label 3D scenes with bounding primitives and developed a model that transfers this information into the 2D image domain, resulting in over 150k images and 1B 3D points with coherent semantic instance annotations across 2D and 3D. Moreover, we established benchmarks and baselines for several tasks relevant to mobile perception, encompassing problems from computer vision, graphics, and robotics on the same dataset, e.g., semantic scene understanding, novel view synthesis and semantic SLAM. KITTI-360 will enable progress at the intersection of these research areas and thus contribute towards solving one of today's grand challenges: the development of fully autonomous self-driving systems.

研究动机与目标

在自动驾驶领域，推动视觉、图形学与机器人学交叉领域的跨学科进展。
提供比 KITTI 更丰富的地理参考数据集，具备密集的 2D/3D 语义/实例标签和多模态感知。
开发高效的 3D-to-2D 标签传输，以在不同视图之间创建一致的注释。
在新数据集上建立语义场景理解、新视图合成和语义 SLAM 的基准。

提出的方法

使用在 3D 中标注的包围原语引入 3D 注释，以获得一致的 2D 像素级和 3D 点级标签。
开发基于 WebGL 的注释工具，在 3D 中标注静态和动态场景元素。
通过一个非局部多场 CRF 将 3D 标签传输到 2D，对 3D 点和 2D 像素进行联合作用推理。
通过将稀疏的 3D 点投影到图像中来训练语义分割网络（PSPNet）并整合实例假设，以引入学习型先验。
在多帧中融合立体和激光扫描，以生成密集的 3D 信息和用于完整标注的虚拟天空点。

实验结果

研究问题

RQ1如何在户外城市/郊区场景中获得跨 2D 与 3D 的密集、连贯的语义和实例注释？
RQ2通过 CRF 的 3D-to-2D 标签传输是否能在一致性和准确性方面超越纯 2D 或纯 3D 方法？
RQ3在全面的、地理参考的城市数据集上，评估语义理解、新视图合成和语义 SLAM 的有效基准是什么？
RQ43D 注释是否能够在视频帧和 360° 传感数据中实现时序一致的实例标签？

主要发现

该数据集包含超过 300k 张图像和 80k 条激光扫描，在 2D 和 3D 中具有一致的语义和实例注释。
一个基于 WebGL 的 3D 注释工具能够标注静态和动态元素，产生密集的 2D/3D 标签并在帧之间保持一致的实例 ID。
通过带有学习的单项/对项项的非局部多场 CRF 的 3D-to-2D 标签传输，在标注方面优于纯 2D 方法和纯学习型方法。
将 3D 注释与 2D 投影整合，能够实现包括语义场景理解、新视图合成和语义 SLAM 在内的全新基准。
论文报告注释时间效率高（全批约 3 小时，在考虑每张图像标注时间时约 0.75 分钟/图像），且在线基准是留出题且具有挑战性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。