QUICK REVIEW

[论文解读] StreamSense: Streaming Social Task Detection with Selective Vision-Language Model Routing

Han Wang, Deyi Ji|arXiv (Cornell University)|Jan 30, 2026

Hate Speech and Cyberbullying Detection被引用 0

一句话总结

StreamSense 将一个轻量级流式编码器与对 Vision–Language 模型专家的选择性路由结合起来，在上下文不足以实现更快、低延迟的社交任务检测时推迟处理，从而应对困难情况。

ABSTRACT

Live streaming platforms require real-time monitoring and reaction to social signals, utilizing partial and asynchronous evidence from video, text, and audio. We propose StreamSense, a streaming detector that couples a lightweight streaming encoder with selective routing to a Vision-Language Model (VLM) expert. StreamSense handles most timestamps with the lightweight streaming encoder, escalates hard/ambiguous cases to the VLM, and defers decisions when context is insufficient. The encoder is trained using (i) a cross-modal contrastive term to align visual/audio cues with textual signals, and (ii) an IoU-weighted loss that down-weights poorly overlapping target segments, mitigating label interference across segment boundaries. We evaluate StreamSense on multiple social streaming detection tasks (e.g., sentiment classification and hate content moderation), and the results show that StreamSense achieves higher accuracy than VLM-only streaming while only occasionally invoking the VLM, thereby reducing average latency and compute. Our results indicate that selective escalation and deferral are effective primitives for understanding streaming social tasks. Code is publicly available on GitHub.

研究动机与目标

实现对视频、文本和音频中的社交信号进行实时直播监测的动机。
开发一个流式检测器，为大多数时间戳使用轻量级编码器，仅将难点路由到 VLM。
通过 IoU 加权的损失缓解分段间的标签干扰，并将跨模态线索与文本信号对齐。
在情感分类和仇恨内容审核等任务上评估 StreamSense。

提出的方法

在实时直播中对大多数时间戳使用轻量级流式编码器。
为困难/模糊情况引入对 Vision–Language 模型（VLM）专家的选择性路由。
结合跨模态对比损失，使视觉/音频线索与文本信号对齐。
应用 IoU 加权损失，对覆盖目标的重叠度较低的部分进行降权，以减少标签干扰。
在上下文信息不足时允许推迟决策。
与仅 VLM 的流式基线进行比较，以衡量准确性、延迟和计算量。

实验结果

研究问题

RQ1选择性路由到 VLM 是否能在流式社交任务检测中提高准确性，相较于仅 VLM 的流式方法？
RQ2IoU 加权损失如何影响分段边界处的标签干扰？
RQ3在低上下文时延迟决策是否能改善整体延迟和资源使用？
RQ4当仅将困难情况升级到 VLM 时，准确性与延迟之间的权衡是什么？

主要发现

StreamSense 在评估任务的社交流检测中，相较于仅 VLM 的流式方法获得更高的准确性。
通过仅在困难/模糊时刻调用 VLM，显著降低平均延迟和计算量。
跨模态对比对齐将视觉/音频线索与文本信号连接，提升有效检测。
IoU 加权损失在分段边界处缓解标签干扰，提高鲁棒性。
选择性升级和推迟成为流式社交通知理解的有效原语。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。