[论文解读] Think with Grounding: Curriculum Reinforced Reasoning with Video Grounding for Long Video Understanding
引入 Video-TwG,一种在多轮推理中按需 grounding 的 think-with-grounding 框架,以在查询相关的视频片段中进行放大,采用两阶段课程与 TwG-GRPO 奖励进行训练,在LVU上取得强劲结果且无需过度监督。
Long video understanding is challenging due to rich and complicated multimodal clues in long temporal range.Current methods adopt reasoning to improve the model's ability to analyze complex video clues in long videos via text-form reasoning.However,the existing literature suffers from the fact that the text-only reasoning under fixed video context may exacerbate hallucinations since detailed crucial clues are often ignored under limited video context length due to the temporal redundancy of long videos.To address this gap,we propose Video-TwG,a curriculum reinforced framework that employs a novel Think-with-Grounding paradigm,enabling video LLMs to actively decide when to perform on-demand grounding during interleaved text-video reasoning, selectively zooming into question-relevant clips only when necessary.Video-TwG can be trained end-to-end in a straightforward manner, without relying on complex auxiliary modules or heavily annotated reasoning tracesIn detail,we design a Two-stage Reinforced Curriculum Strategy, where the model first learns think-with-grounding behavior on a small short-video GQA dataset with grounding labels,and then scales to diverse general QA data with videos of diverse domains to encourage generalization. Further, to handle complex think-with-grounding reasoning for various kinds of data,we propose TwG-GRPO algorithm which features the fine-grained grounding reward, self-confirmed pseudo reward and accuracy-gated mechanism.Finally,we propose to construct a new TwG-51K dataset that facilitates training. Experiments on Video-MME, LongVideoBench, and MLVU show that Video-TwG consistently outperforms strong LVU baselines.Further ablation validates the necessity of our Two-stage Reinforced Curriculum Strategy and shows our TwG-GRPO better leverages diverse unlabeled data to improve grounding quality and reduce redundant groundings without sacrificing QA performance.
研究动机与目标
- 通过解决仅文本推理在固定视频上下文中的不足,推动鲁棒的长时视频理解。
- 提出一种 think-with-grounding 范式,在推理过程中 grounding 动作有选择地放大到相关的视频剪辑。
- 开发一种两阶段的强化课程策略,使模型从短时-grounded 数据到多样的长视频问答场景进行训练。
- 引入 TwG-GRPO,一种具有细粒度奖励与伪奖励以及精度门控机制的 grounding 感知强化学习算法。
- 创建 TwG-51K 数据集,结合有 grounding 注释的和未标注的视频问答数据以支撑训练。
提出的方法
- 定义一个多轮 think-with-grounding 过程,模型在各轮输出思维步骤、grounding 动作(起始/结束帧)和答案。
- 实现两阶段课程:阶段1 在带 grounding 标签的短视频 GQA 数据上训练;阶段2 扩展到未标注、多样化的视频问答数据。
- 提出 TwG-GRPO,一种基于 GRPO 的 RL 算法,具备轨迹级奖励,包括细粒度 grounding 奖励、自证伪奖励,以及用于在 grounding 与问答准确性之间平衡的精度门控机制。
- 使用多粒度视频表示:初始推理采用粗粒度输入, grounding 段使用细粒度剪辑, grounding 帧映射回初始视频帧。
- 构建 TwG-51K 数据集(50,744 个多选样本;8,195 个具有 grounding 注释)以支撑训练与泛化。
实验结果
研究问题
- RQ1相比仅文本推理在固定视频上下文中,动态按需 grounding 如何提升长时视频理解?
- RQ2两阶段强化课程是否能提升 LVU 中 think-with-grounding 的学习稳定性和泛化?
- RQ3TwG-GRPO 能否有效利用带标注 grounding 的数据与未标注的问答数据来提升 grounding 质量与问答准确性?
- RQ4多粒度视频表示与 grounding 动作对长时视频问答表现有何影响?
- RQ5Video-TwG 相对于 Video-MME、LongVideoBench、MLVU 等强基线在各基准上的表现如何?
主要发现
- Video-TwG 在 Video-MME、LongVideoBench 和 MLVU 的强基线上始终实现优越性能。
- 在 LR 输入下,Video-TwG(LR) 在 Video-MME 的整体准确率提升 7.0 点,在其他指标上提升 5.8–7.1 点;在 HR 输入下,提升为 2.5–5.0 点,视基准而定。
- 消融研究验证了“两阶段强化课程策略”的必要性,并显示 TwG-GRPO 在从未标注数据学习 grounding 的同时保持问答性能。
- grounding 奖励(软性与基于 IoU 的硬性奖励)再加上自证伪伪奖励,能有效引导 grounding 动作并在不影响答案的情况下减少不必要的 grounding。
- 与 Qwen2.5-VL-7B 相比,Video-TwG 在长时视频任务上取得显著提升,尤其在低资源设置下,表明训练范式是主要的收益来源。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。