QUICK REVIEW

[논문 리뷰] RIVER: A Real-Time Interaction Benchmark for Video LLMs

Yansong Shi, Qingsong Zhao|arXiv (Cornell University)|2026. 03. 04.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

이 논문은 실시간 상호작용을 평가하는 RIVER Bench를 소개하며, 레트로-메모리, 라이브-지각, 적극적 응답에 중점을 두고, 메모리 보강 프레임워크와 온라인 이해를 향상시키는 학습 데이터셋을 제시한다.

ABSTRACT

The rapid advancement of multimodal large language models has demonstrated impressive capabilities, yet nearly all operate in an offline paradigm, hindering real-time interactivity. Addressing this gap, we introduce the Real-tIme Video intERaction Bench (RIVER Bench), designed for evaluating online video comprehension. RIVER Bench introduces a novel framework comprising Retrospective Memory, Live-Perception, and Proactive Anticipation tasks, closely mimicking interactive dialogues rather than responding to entire videos at once. We conducted detailed annotations using videos from diverse sources and varying lengths, and precisely defined the real-time interactive format. Evaluations across various model categories reveal that while offline models perform well in single question-answering tasks, they struggle with real-time processing. Addressing the limitations of existing models in online video interaction, especially their deficiencies in long-term memory and future perception, we proposed a general improvement method that enables models to interact with users more flexibly in real time. We believe this work will significantly advance the development of real-time interactive video understanding models and inspire future research in this emerging field. Datasets and code are publicly available at https://github.com/OpenGVLab/RIVER.

연구 동기 및 목표

대화형 온라인 비디오 이해를 정의하고 메모리, 인지 및 예측에 대한 지표를 형식화한다.
회고(레트로스펙션), 실시간 지각, 적극적 응답에 기초한 정확하고 시간적으로 근거 있는 평가 프레임워크를 만든다.
다양한 비디오 소스로부터 데이터셋을 선별하고 주석을 달아 온라인 상호작용 테스트의 정확성을 높인다.
메모리 보강 아키텍처와 온라인 추론을 향상시키는 인터랙티브 학습 데이터셋을 제안한다.

제안 방법

RIVER Bench를 세 가지 작업 유형(레트로-메모리, 라이브-지각, 프로-응답)으로 제안하여 온라인 비디오 이해에서 시간적 인식을 평가한다.
스트리밍 입력 및 타임스탬프가 찍힌 신호, 질문, 답변을 포함한 윈도우 기반 비디오-텍스트-투-텍스트 형식을 구성한다.
Vript-RR, LVBench, LongVideoBench, Ego4D, QVHighlights의 데이터를 정확한 시간 주석과 함께 선별한다.
장기 보유를 위한 과거 시각 정보를 저장하고 압축하는 장-단기 기억 모듈을 도입한다.
향후 상호작용 능력을 향상시키기 위해 특수한 인터랙티브 학습 데이터셋으로 모델을 파인튜닝한다.

Figure 1: Illustration of different online interaction tasks. The question (Query), reference events (Cue), and answers timings are represented by , and , respectively. Based on the frequency and timing of reference events, questions, and answers, we further categorize online interaction tasks into

실험 결과

연구 질문

RQ1현재 MLLMs가 스트리밍 비디오에서 회고적 기억, 라이브 지각, 그리고 적극적 응답을 얼마나 잘 처리하는가?
RQ2온라인 비디오 이해에서 기억 메커니즘이 장기 검색 및 응답 시의성에 어떤 영향을 미치는가?
RQ3온라인 트레이닝 패러다임이 비디오 스트림에서 선제적이고 미래 지향적인 상호작용 성능을 향상시킬 수 있는가?

주요 결과

오프라인 모델은 단일 QA 작업에서 우수하지만 엄격한 실시간 스트리밍 설정에서는 어려움을 겪는다.
메모리 보강 및 온라인 추론 방법은 레트로-메모리 및 프로-응답 작업에서 성능을 향상시킨다.
특수한 인터랙티브 학습 데이터셋이 온라인 상호작용 능력을 크게 향상시키며, 특히 적극적 응답에 효과적이다.
GPT-4o가 벤치마크에서 라이브-지각, 레트로-메모리, 그리고 프로-응답 작업 전반에서 강력한 성능을 달성한다.

RIVER: A Real-Time Interaction Benchmark for Video LLMs

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.