QUICK REVIEW

[论文解读] StreamReady: Learning What to Answer and When in Long Streaming Videos

Shehreen Azad, Vibhav Vineet|arXiv (Cornell University)|Mar 9, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

StreamReady 引入 Answer Readiness Score（ARS）以联合优化正确性与流式视频问答的时序性，并提出一个基于就绪的框架，包含记忆与轻量级就绪机制，仅在证据充分时回答。它还提出 ProReady-QA，一个用于长时间流式视频的主动多轮问答基准。

ABSTRACT

Streaming video understanding often involves time-sensitive scenarios where models need to answer exactly when the supporting visual evidence appears: answering before the evidence reflects speculation, answering after it has passed reduces real-time utility. To capture this behavior, we introduce a readiness-aware formulation of streaming video understanding with the Answer Readiness Score (ARS), a timing-aware objective with asymmetric early and late penalties. When combined with correctness, ARS defines an effective accuracy that measures not just whether a model is right, but whether it answers at the appropriate moment. Building on this formulation, we introduce StreamReady, a framework to unify temporal reasoning with on-time answering through a lightweight readiness mechanism that decides if sufficient evidence has been observed before responding. To evaluate this capability, we further introduce ProReady-QA, a benchmark with annotated answer evidence windows and proactive multi-turn questions across local and global contexts. StreamReady achieves superior performance on ProReady-QA, and consistently outperforms prior methods across eight additional streaming and offline long-video benchmarks, demonstrating robust and broadly generalizable video understanding capability.

研究动机与目标

形式化就绪感知的流式理解，关注何时回答，而不仅仅是回答什么。
定义 Answer Readiness Score（ARS），对过早与过晚的回答施以不对称惩罚。
开发一个轻量级就绪机制，在有足够证据出现后再触发回答。
构建 StreamReady，将时序推理与就绪信号结合起来，使用记忆增强的问答实现。

提出的方法

引入 ARS，一种将早期与晚期惩罚与准确性相结合的非对称时序感知评估指标。
提出 StreamReady，使用分层 Visual Memory Tree 和 Contextual Memory Bank 存储与检索多粒度的视觉与语义历史。
使用双分支 Q-Former 对记忆槽进行短期与长期的查询感知推理。
结合可学习的 <RDY> 令牌和 Readiness Head，对回答生成进行门控并强制时序正确性。
开发 ProReady-QA，具有带有注释证据窗口的主动多轮问题以在流式任务中评估就绪性。
通过来自记忆表示的伪监督学习就绪信号，而不需要-ground-truth证据时间戳来训练。

Figure 2 : Framework Overview. StreamReady encodes streaming videos into a visual memory tree and reasons through short and long-term branches. A learnable <RDY> token, guided by a readiness head, gates the reasoning output until sufficient evidence is observed. Once ready, the long-term representat

实验结果

研究问题

RQ1我们如何正式评估流式视频问答中回答的正确性与时序性？
RQ2一个轻量级的就绪机制是否能可靠地判断何时积累了足够的视觉证据来回答？
RQ3记忆增强的推理是否能在主动流式场景中同时提升准确性与响应性？
RQ4就绪感知的流式是否能很好地泛化到长视频与不同的流式基准？

主要发现

StreamReady 在 ProReady-QA 任务上相较基线实现了更高的准确性和 ARS，表明更好的时序对齐与正确性。
就绪机制减少了错时响应，缩短证据与答案之间的时间错位。
StreamReady 在多个流式基准上持续优于先前方法，包括主动与非主动设置。
记忆层级结构与查询感知推理实现了对长时间跨度的稳健理解与更好的证据检索。
StreamReady 在离线长视频基准上也表现出色，表明对就绪评估之外的泛化能力。

Figure 3 : Examples of each task in ProReady-QA. Here, the question and answer frames are color-coded.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。