QUICK REVIEW

[论文解读] Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

Yikai Zheng, Xin Ding|arXiv (Cornell University)|Mar 19, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

Em-Garde 将语义解析与流式感知解耦，通过在查询时将查询解析为视觉方案，并在流式过程中使用轻量级的逐帧匹配器，实现对流媒体视频理解的高效、实时主动响应。

ABSTRACT

Recent advances in Streaming Video Understanding has enabled a new interaction paradigm where models respond proactively to user queries. Current proactive VideoLLMs rely on per-frame triggering decision making, which suffers from an efficiency-accuracy dilemma. We propose Em-Garde, a novel framework that decouples semantic understanding from streaming perception. At query time, the Instruction-Guided Proposal Parser transforms user queries into structured, perceptually grounded visual proposals; during streaming, a Lightweight Proposal Matching Module performs efficient embedding-based matching to trigger responses. Experiments on StreamingBench and OVO-Bench demonstrate consistent improvements over prior models in proactive response accuracy and efficiency, validating an effective solution for proactive video understanding under strict computational constraints.

研究动机与目标

在紧凑的计算预算下，解决主动流媒体视频理解中的效率-准确性困境。
将语义推理与逐帧感知解耦，以实现实时触发决策。
在查询时将用户查询转换为在感知上有依据的视觉提案，以引导轻量级的流式感知。
策划数据集并在主动响应与在线理解的标准基准上展示改进。

提出的方法

Instruction-Guided Proposal Parser (IGPP) 将自然语言指令转换为结构化视觉提案，使用一个大型多模态语言模型。
Parse2Prop-1K 数据集用于对 IGPP 进行有监督微调与强化学习，以优化触发准确性。
Lightweight Proposal Matching Module (LPMM) 在流式循环中运行，将短视频片段和提案嵌入到轻量级多模态空间，并计算余弦相似度。
触发决策来自相似度分数的时序演化，通过简单的阈值规则实现。
视觉编码缓存通过重复使用帧编码来加速流式处理，在 A100 GPU 上可实现长视频的 10–15 fps。
LPMM 不需要对嵌入模型进行微调；使用现成的嵌入模型（Ops-MM-V1）。
对 IGPP 的训练包括对 Parse2Prop-1K 使用人类或 GPT-5 生成的提案进行有监督学习；强化学习奖励在事件近端触发时正确触发，并带有可调的假阳性惩罚 λ。
评估遵循既有的主动式流媒体基准，以评估触发准确性及下游响应。

Figure 1 : Demonstration of our model v.s. existing Streaming VideoLLMs on the Proactive Streaming Understanding task. While existing models solve a complicated response/silence decision-making problem at every timestep, we turn the problem into a simple perception problem with query-time semantic p

实验结果

研究问题

RQ1在高度计算预算受限的情况下，解耦语义解析与感知是否可以在不牺牲触发准确性的前提下提升主动流媒体视频理解的效率？
RQ2一个以指令为引导的提案者能否生成可被轻量感知模块可靠匹配的感知性视觉线索？
RQ3基于强化学习的提案优化对触发时序和误触发有何影响？
RQ4在标准的主动响应与在线视频理解基准上，Em-Garde 框架在准确性与速度方面表现如何？
RQ5不同任务挑战下，触发阈值设置的权衡是什么？

主要发现

在 StreamingBench 和 OVO-Bench 的主动响应准确性方面，优于现有的实时主动流媒体模型（在 StreamingBench 上提高超过 3% 的准确性，在 OVO-Bench 上提高 10% 的 F1）。
在 A100 GPU 上实现任意长视频的 10–15 fps 的最先进处理速度。
在实时感知任务上保持与 SOTA 流式多模态大型语言模型相比的竞争性在线视频理解表现（StreamingBench 与 OVO-Bench）。
RL 训练提升了提案质量，使感知线索与感知模块对齐并改善触发时序。
显式阈值控制（θ）在跨任务中提供可调的召回与精确度之间的权衡。
两阶段设计（IGPP + LPMM）有效地将繁重的语义推理与快速感知解耦，支持长期前景的可扩展主动流式处理。

Figure 2 : Overview of the Em-Garde Framework: IGPP (Orange) receives the Instrcution $I$ and a low-fps video context before query time, and parse the instruction into perceptually-grounded visual cues. LPMM (Blue) runs in the streaming loop, matching the current sliding-window video segment to the

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。