QUICK REVIEW

[论文解读] Simultaneous multi-view instance detection with learned geometric soft-constraints

Ahmed Nassar, Sébastien Lefèvre|arXiv (Cornell University)|Jul 25, 2019

Video Surveillance and Tracking Methods参考文献 43被引用 28

一句话总结

本文提出了一种端到端的深度学习方法，用于在街景全景图中实现同时的多视角实例检测与重识别，通过使用噪声相机位姿作为弱监督，联合学习几何软约束和外观特征。该方法显著提升了检测准确率与地理定位性能，在帕萨迪纳树木数据集上实现了3.13米的平均绝对误差，在Mapillary数据集上实现了88%的重识别mAP，优于单视角基线模型。

ABSTRACT

We propose to jointly learn multi-view geometry and warping between views of the same object instances for robust cross-view object detection. What makes multi-view object instance detection difficult are strong changes in viewpoint, lighting conditions, high similarity of neighbouring objects, and strong variability in scale. By turning object detection and instance re-identification in different views into a joint learning task, we are able to incorporate both image appearance and geometric soft constraints into a single, multi-view detection process that is learnable end-to-end. We validate our method on a new, large data set of street-level panoramas of urban objects and show superior performance compared to various baselines. Our contribution is threefold: a large-scale, publicly available data set for multi-view instance detection and re-identification; an annotation tool custom-tailored for multi-view instance detection; and a novel, holistic multi-view instance detection and re-identification method that jointly models geometry and appearance across views.

研究动机与目标

为解决在存在大视角变化、光照变化和尺度差异的街景全景图中实现鲁棒的跨视角实例检测与重识别的挑战。
联合学习同一对象实例在不同视角之间的多视角几何结构与映射函数，使用噪声相对相机位姿作为弱监督。
构建一个大规模、公开可用的数据集，并开发一种定制化标注工具，用于多视角实例检测与重识别。
通过建模跨视角的相机位姿与目标实例外观的联合分布，提升目标检测与地理定位的准确性。

提出的方法

该方法采用多视角检测框架，集成一个“投影网络”（Projection Net），基于相机位姿和目标外观预测视角间的映射函数。
引入“地理回归网络”（Geo Regression Net）以回归检测目标的地理坐标，支持带有几何软约束的端到端训练。
通过统一损失函数联合优化目标检测、实例重识别与地理定位，该损失函数结合了检测损失、重识别损失与回归损失。
几何软约束通过网络对一致位姿-实例对应关系的关注隐式学习，从而减少相似目标匹配中的歧义。
该框架采用类似Siamese的结构进行重识别，通过学习到的相似性度量在不同视角间比较特征。
系统在新的地理定位街景全景图数据集上进行端到端训练，并通过数据增强技术模拟真实世界中的失真。

实验结果

研究问题

RQ1联合学习几何与外观是否能提升在具有挑战性的街景全景图设置下的多视角实例检测与重识别性能？
RQ2将噪声相对相机位姿作为弱监督，是否能提升检测与重识别性能？
RQ3端到端学习映射函数与几何约束在多大程度上能减少误报并提升地理定位准确性？
RQ4所提出的方法在不同数据采集设计（如短基线的前向摄像头）下是否具备泛化能力？

主要发现

在帕萨迪纳树木数据集上，该方法在目标检测上达到68.2%的mAP，在实例重识别上达到73.1%的mAP，显著优于单视角基线模型。
在Mapillary数据集上，该方法在检测任务中达到90.2%的mAP，在重识别任务中达到88.2%的mAP，表明其在不同数据采集设置下具有强大的泛化能力。
地理定位的平均绝对误差（MAE）在帕萨迪纳数据集上降低至3.13米，在Mapillary数据集上降低至4.36米，而单视角投影方法的误差分别为77.41米和83.27米。
消融实验表明，学习相机位姿与外观的联合分布能显著提升重识别性能，有效区分外观相似的实例。
该方法成功应对了强透视变化、尺度变化以及全景拼接带来的图像伪影等具有挑战性的条件。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。