Skip to main content
QUICK REVIEW

[论文解读] RIVER: A Real-Time Interaction Benchmark for Video LLMs

Yansong Shi, Qingsong Zhao|arXiv (Cornell University)|Mar 4, 2026
Multimodal Machine Learning Applications被引用 0
一句话总结

论文提出 RIVER Bench,用于评估视频语言模型的实时交互,关注回顾记忆、实时感知与主动响应,并提供一个记忆增强框架与训练数据集以提升在线理解。

ABSTRACT

The rapid advancement of multimodal large language models has demonstrated impressive capabilities, yet nearly all operate in an offline paradigm, hindering real-time interactivity. Addressing this gap, we introduce the Real-tIme Video intERaction Bench (RIVER Bench), designed for evaluating online video comprehension. RIVER Bench introduces a novel framework comprising Retrospective Memory, Live-Perception, and Proactive Anticipation tasks, closely mimicking interactive dialogues rather than responding to entire videos at once. We conducted detailed annotations using videos from diverse sources and varying lengths, and precisely defined the real-time interactive format. Evaluations across various model categories reveal that while offline models perform well in single question-answering tasks, they struggle with real-time processing. Addressing the limitations of existing models in online video interaction, especially their deficiencies in long-term memory and future perception, we proposed a general improvement method that enables models to interact with users more flexibly in real time. We believe this work will significantly advance the development of real-time interactive video understanding models and inspire future research in this emerging field. Datasets and code are publicly available at https://github.com/OpenGVLab/RIVER.

研究动机与目标

  • 定义交互式在线视频理解并形式化记忆、感知与预期的度量标准。
  • 建立一个基于回顾、实时感知与主动响应的精确、时序化的评估框架。
  • 从多样化视频源策划并标注数据集,以用于准确的在线交互测试。
  • 提出一个记忆增强架构与一个交互式训练数据集,以提升在线推理能力。

提出的方法

  • 提出带有三种任务类型(回顾记忆、实时感知、主动响应)的 RIVER Bench,以评估在线视频理解中的时序感知。
  • 构建一个基于窗口的视频文本到文本的表述,使用流输入、带时间戳的线索、问题与答案。
  • 从 Vript-RR、LVBench、LongVideoBench、Ego4D 与 QVHighlights 策划数据,并给出精确的时间注解。
  • 引入一个长短期记忆模块以存储并压缩历史视觉信息以实现长期保留。
  • 使用专门的交互式训练数据集对模型进行微调,以提升未来交互能力。
Figure 1: Illustration of different online interaction tasks. The question (Query), reference events (Cue), and answers timings are represented by , and , respectively. Based on the frequency and timing of reference events, questions, and answers, we further categorize online interaction tasks into
Figure 1: Illustration of different online interaction tasks. The question (Query), reference events (Cue), and answers timings are represented by , and , respectively. Based on the frequency and timing of reference events, questions, and answers, we further categorize online interaction tasks into

实验结果

研究问题

  • RQ1当前多模态语言模型在流媒体视频中的回顾记忆、实时感知与主动响应方面表现如何?
  • RQ2记忆机制对在线视频理解中的长期检索与响应时效有何影响?
  • RQ3在线训练范式是否能提升模型在视频流中的主动性、面向未来的交互表现?

主要发现

  • 离线模型在单一问答任务上表现出色,但在严格的实时流媒体设置中表现不足。
  • 记忆增强与在线推理方法能提升回顾记忆与主动响应任务的性能。
  • 一个专门的交互式训练数据集显著提升在线交互能力,尤其是主动响应能力。
  • GPT-4o 在基准测试中在实时感知、回顾记忆与主动响应任务上均表现出强大的整体性能。
RIVER: A Real-Time Interaction Benchmark for Video LLMs

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。