QUICK REVIEW

[論文レビュー] RIVER: A Real-Time Interaction Benchmark for Video LLMs

Yansong Shi, Qingsong Zhao|arXiv (Cornell University)|Mar 4, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

本論文は、リアルタイム相互作用を評価する RIVER Bench を提案し、 retro-memory、live-perception、pro-active response に焦点を当て、オンライン理解を高めるメモリ拡張フレームワークとトレーニングデータセットを提供する。

ABSTRACT

The rapid advancement of multimodal large language models has demonstrated impressive capabilities, yet nearly all operate in an offline paradigm, hindering real-time interactivity. Addressing this gap, we introduce the Real-tIme Video intERaction Bench (RIVER Bench), designed for evaluating online video comprehension. RIVER Bench introduces a novel framework comprising Retrospective Memory, Live-Perception, and Proactive Anticipation tasks, closely mimicking interactive dialogues rather than responding to entire videos at once. We conducted detailed annotations using videos from diverse sources and varying lengths, and precisely defined the real-time interactive format. Evaluations across various model categories reveal that while offline models perform well in single question-answering tasks, they struggle with real-time processing. Addressing the limitations of existing models in online video interaction, especially their deficiencies in long-term memory and future perception, we proposed a general improvement method that enables models to interact with users more flexibly in real time. We believe this work will significantly advance the development of real-time interactive video understanding models and inspire future research in this emerging field. Datasets and code are publicly available at https://github.com/OpenGVLab/RIVER.

研究の動機と目的

インタラクティブなオンライン動画理解を定義し、memory、perception、anticipation の指標を形式化する。
retrospection、live perception、proactive response に基づく正確で時刻的根拠のある評価フレームワークを確立する。
オンライン相互作用テストのために多様な動画ソースからデータセットを整理・注釈する。
メモリ拡張アーキテクチャとオンライン推論を改善するインタラクティブなトレーニングデータセットを提案する。

提案手法

オンライン動画理解における時間的認識を評価する三つのタスクタイプ（retro-memory、live-perception、pro-response）を備えた RIVER Bench を提案する。
ストリーミング入力と時間スタンプ付き手掛かり、質問、回答を用いたウィンドウベースの動画-テキスト-べースの定式化を構築する。
Vript-RR、LVBench、LongVideoBench、Ego4D、QVHighlights から正確な時系列注釈付きデータを整理する。
過去の視覚情報を保存・圧縮して長期保持を可能にする長短期記憶モジュールを導入する。
将来の相互作用能力を高めるため、専門的なインタラクティブなトレーニングデータセットでモデルをファインチューニングする。

Figure 1: Illustration of different online interaction tasks. The question (Query), reference events (Cue), and answers timings are represented by , and , respectively. Based on the frequency and timing of reference events, questions, and answers, we further categorize online interaction tasks into

実験結果

リサーチクエスチョン

RQ1現在の MLLMs はストリーミング映像における回顧的 memory、リアルタイ perception、能動的応答をどれほど処理できるか。
RQ2 memory メカニズムがオンライン動画理解における長期的な検索と応答のタイムリネスに与える影響はどうか。
RQ3オンライン学習パラダイムは動画ストリームにおける能動的で未来志向の相互作用のモデル性能を改善できるか。

主な発見

オフラインのモデルは単一の QA タスクに長けるが、厳格なリアルタイムストリーミング設定では苦戦する。
memory-augmented および online-inference アプローチは retro-memory および pro-response タスクの性能を改善する。
専門的なインタラクティブなトレーニングデータセットはオンライン相互作用能力、特に能動的応答を大幅に高める。
GPT-4o はベンチマークにおいて live-perception、retro-memory、pro-response のタスク全体で強い総合性能を達成する。

RIVER: A Real-Time Interaction Benchmark for Video LLMs

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。