QUICK REVIEW

[论文解读] RT-RMOT: A Dataset and Framework for RGB-Thermal Referring Multi-Object Tracking

Yanqiu Yu, Zhifan Jin|arXiv (Cornell University)|Feb 25, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

本文提出 RT-RMOT，提出 RefRT RGB-Thermal RMOT 数据集，并展示 RTrack，这是一个多模态框架，利用带有强化学习增强的多模态大语言模型实现对 RGB-T RMOT 的最先进性能。

ABSTRACT

Referring Multi-Object Tracking has attracted increasing attention due to its human-friendly interactive characteristics, yet it exhibits limitations in low-visibility conditions, such as nighttime, smoke, and other challenging scenarios. To overcome this limitation, we propose a new RGB-Thermal RMOT task, named RT-RMOT, which aims to fuse RGB appearance features with the illumination robustness of the thermal modality to enable all-day referring multi-object tracking. To promote research on RT-RMOT, we construct the first Referring Multi-Object Tracking dataset under RGB-Thermal modality, named RefRT. It contains 388 language descriptions, 1,250 tracked targets, and 166,147 Language-RGB-Thermal (L-RGB-T) triplets. Furthermore, we propose RTrack, a framework built upon a multimodal large language model (MLLM) that integrates RGB, thermal, and textual features. Since the initial framework still leaves room for improvement, we introduce a Group Sequence Policy Optimization (GSPO) strategy to further exploit the model's potential. To alleviate training instability during RL fine-tuning, we introduce a Clipped Advantage Scaling (CAS) strategy to suppress gradient explosion. In addition, we design Structured Output Reward and Comprehensive Detection Reward to balance exploration and exploitation, thereby improving the completeness and accuracy of target perception. Extensive experiments on the RefRT dataset demonstrate the effectiveness of the proposed RTrack framework.

研究动机与目标

在夜间、烟雾等低能见度条件下，通过融合 RGB、热成像和语言线索，推动鲁棒的 Refering 多目标跟踪研究。
创建 RefRT，这是首个具像素级 RGB–热对齐与语言标注的 RGB-T RMOT 数据集。
开发 RTrack，这是一个利用多模态大语言模型的联合 RGB-热-语言感知与跟踪的多模态学习框架。
引入优化策略（GSPO、CAS）和奖励设计，以稳定 RT-RMOT 的强化学习微调并在探索-利用之间取得平衡。

提出的方法

提出 RTrack，一个三模块框架：大型模型感知模块，利用对齐的 RGB 与热输入及语言描述，在跨模态定位上使用多模态大语言模型；轨迹预测模块，使用卡尔曼滤波作为运动先验；身份关联模块，采用基于 IoU 的匈牙利匹配来维持身份。
通过使用分组序列策略优化（GSPO）来优化序列级输出、使用截断优势缩放（CAS）抑制梯度爆炸，以及基于规则的奖励系统（结构化输出奖励和全面检测奖励）来平衡输出结构、长度和检测质量，以增强 RL 微调。
通过利用 LasHeR 和 VTUAV 作为基础构建 RefRT；利用 GPT 协助的属性生成并经人工验证，获得 388 条语言描述、1,250 个目标、72 个场景和 166,147 个 RGB–热–语言三元组。
在 RefRT 上使用 RMOT 风格的评估指标（HOTA、DetA、AssA、DetRe、DetPr、AssRe、AssPr、LocA），并在 RGB-T 输入下展示 RTrack 的最先进性能。

实验结果

研究问题

RQ1RGB-Thermal 数据与语言引导的融合是否能够在低能见度场景下实现健壮的全天候 RMOT？
RQ2基于 MLLM 的感知若与卡尔曼滤波辅助的轨迹模型以及 IoU 基的身份关联结合，在 RT-RMOT 上的表现如何？
RQ3强化学习微调策略（GSPO 与 CAS）及结构化奖励是否提升跨模态跟踪性能与稳定性？
RQ4在 RT-RMOT 设置中，RGB 与 RGB-T 输入对 RMOT 性能的影响如何？

主要发现

模态	方法	场景/基准	HOTA	DetA	AssA	DetRe	DetPr	AssRe	AssPr	LocA
RGB	TransRMOT	CVPR 2023	8.69	2.57	29.96	3.01	14.46	30.73	85.49	79.63
RGB	TempRMOT	ArXiv 2024	8.19	1.86	36.23	2.04	16.68	39.28	75.39	77.48
RGB	CRTracker	AAAI 2025	9.30	2.37	37.01	3.81	5.83	40.10	67.48	73.25
RGB	YOLOX+ByteTrack+ iKUN	CVPR 2024	2.32	0.29	19.86	0.29	12.71	21.18	61.45	69.70
RGB	Qwen2.5-VL-3B	ArXiv 2025	2.09	0.93	5.28	0.97	17.14	5.40	87.46	76.69
RGB-T	DeformCAT +SORT+iKUN	IEEE TMM	2.03	0.41	11.25	0.77	0.87	12.07	47.65	62.61
RGB-T	Unismot +iKUN	PR 2025	1.95	0.29	14.34	0.31	3.98	15.41	65.48	70.86
RGB-T	PFTrack +iKUN	PR 2025	8.55	1.66	45.92	2.40	5.05	49.15	73.96	76.31
RGB-T	MCTrack +iKUN	TCSVT 2025	4.71	1.22	18.91	1.51	5.73	19.83	71.17	68.95
RGB-T	Qwen2.5-VL-3B(baseline)	ArXiv 2025	4.98	2.59	10.19	3.05	14.29	10.65	83.40	75.52
RGB-T	RTrack	Ours	15.53	12.39	20.79	20.15	22.78	22.02	81.99	75.53

RTrack 在 RefRT 上实现最先进的性能，在多项指标（如 HOTA、DetA、DetRe）上相对于 RGB 与 RGB-T 基线有显著提升。
在 RGB 输入下，经过 RL 微调的 RTrack 相较未训练版本将 HOTA 提升了 10.4 个百分点；在 RGB-T 输入下，若干指标的提升超过 10 个百分点。
RGB-T 输入优于仅 RGB 的基线，证实热轮廓信息对全天侯 RMOT 的价值。
消融实验显示 Qwen2.5-VL-3B 在多模态融合的测试骨架中表现为强基线；带 RGB-T 与 RL 微调的 RTrack 始终优于 RGB 基线。
GSPO 配合 CAS 以及结构化/全面奖励对提升稳定性、输出质量和多目标检测准确性具有显著贡献。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。