QUICK REVIEW

[论文解读] 3D Object Tracking with Transformer

Yubo Cui, Zheng Fang|arXiv (Cornell University)|Oct 28, 2021

Video Surveillance and Tracking Methods参考文献 24被引用 32

一句话总结

LTTR 引入基于 transformer 的特征融合框架用于 LiDAR 基于的 3D 目标追踪，通过建模区域内/区域之间的关系以及跨分支信息交换，在 KITTI 上实现了最先进的结果。

ABSTRACT

Feature fusion and similarity computation are two core problems in 3D object tracking, especially for object tracking using sparse and disordered point clouds. Feature fusion could make similarity computing more efficient by including target object information. However, most existing LiDAR-based approaches directly use the extracted point cloud feature to compute similarity while ignoring the attention changes of object regions during tracking. In this paper, we propose a feature fusion network based on transformer architecture. Benefiting from the self-attention mechanism, the transformer encoder captures the inter- and intra- relations among different regions of the point cloud. By using cross-attention, the transformer decoder fuses features and includes more target cues into the current point cloud feature to compute the region attentions, which makes the similarity computing more efficient. Based on this feature fusion network, we propose an end-to-end point cloud object tracking framework, a simple yet effective method for 3D object tracking using point clouds. Comprehensive experimental results on the KITTI dataset show that our method achieves new state-of-the-art performance. Code is available at: https://github.com/3bobo/lttr.

研究动机与目标

激励在稀疏、无序的 LiDAR 点云中改进特征融合以进行 3D 跟踪。
利用自注意力在点云内捕捉区域间和区域内的关系。
通过跨注意力融合模板与搜索特征，以增强目标线索。
开发一个端到端的跟踪框架，采用简单高效的基于 transformer 的设计。
展示在 KITTI 上的最先进性能，并提供消融实验以验证设计选择。

提出的方法

将点云划分为不重叠的局部区域，并应用 transformer 编码器来捕捉区域内和区域间的关系。
使用 transformer 解码器通过跨注意力将模板特征传播到搜索特征，以实现区域层面的融合。
计算区域注意力权重，并通过引导合并过程恢复密集特征，以使回归头可用。
采用基于中心的回归头，预测热力图、偏移、z 位置和方向以进行 3D 盒定位。
端到端训练，损失函数结合热力图 focal 损失和回归目标的 L1 损失。

实验结果

研究问题

RQ1基于 transformer 的特征融合是否能够提升区域注意力和相似度计算，用于 LiDAR 基于的 3D 跟踪？
RQ2模板与搜索特征之间的跨注意力是否在稀疏点云中提高跟踪精度和鲁棒性？
RQ3端到端的 LTTR 框架是否能够在 KITTI 上覆盖多类对象实现最先进的性能？

主要发现

LTTR 在 KITTI 上达到最先进的结果，尤其在 Car 类别的 Success 为 65.0，Precision 为 77.1。
编码器和解码器组件都相比非 transformer 的基线带来显著的性能提升。
增加 transformer 头至多达 8 个可提升性能；头过多可能降低朝向预测的准确性。
该框架保持实时可行性，且在对小物体（行人、自行车）追踪方面有显著改进。
消融结果显示区域级交互和跨分支融合是获得更高准确性的关键。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。