QUICK REVIEW

[論文レビュー] EventMemAgent: Hierarchical Event-Centric Memory for Online Video Understanding with Adaptive Tool Use

Siwei Wen, Zhangcheng Wang|arXiv (Cornell University)|Feb 17, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

EventMemAgentは、エージェント型RLを用いたデュアルレイヤー階層メモリとマルチグラニュラー知覚ツールキットを備え、固定コンテキストウィンドウ下でのオンライン動画理解を事前情報を伴って能動的に行えるようにする。

ABSTRACT

Online video understanding requires models to perform continuous perception and long-range reasoning within potentially infinite visual streams. Its fundamental challenge lies in the conflict between the unbounded nature of streaming media input and the limited context window of Multimodal Large Language Models (MLLMs). Current methods primarily rely on passive processing, which often face a trade-off between maintaining long-range context and capturing the fine-grained details necessary for complex tasks. To address this, we introduce EventMemAgent, an active online video agent framework based on a hierarchical memory module. Our framework employs a dual-layer strategy for online videos: short-term memory detects event boundaries and utilizes event-granular reservoir sampling to process streaming video frames within a fixed-length buffer dynamically; long-term memory structuredly archives past observations on an event-by-event basis. Furthermore, we integrate a multi-granular perception toolkit for active, iterative evidence capture and employ Agentic Reinforcement Learning (Agentic RL) to end-to-end internalize reasoning and tool-use strategies into the agent's intrinsic capabilities. Experiments show that EventMemAgent achieves competitive results on online video benchmarks. The code will be released here: https://github.com/lingcco/EventMemAgent.

研究の動機と目的

無制限入力とMLLMsの固定コンテキストウィンドウ下でのオンライン動画理解の課題に対処する。
短期イベント中心のバッファリングと長期イベント中心のアーカイブを分離する階層メモリシステムを提案する。
タスクに関連する証拠を能動的に捉え活用するマルチグラニュラー知覚ツ toolkitを開発する。
Agentic Reinforcement Learningを通じて推論とツール使用戦略を内在化し、エンドツーエンドの最適化を図る。

提案手法

イベントベースのバッファリングとイベント内のリザーバサンプリングを伴う短期記憶STMを有する階層メモリモジュールを実装する。
取り除かれたイベントを長期メモリLTMへ、キャプション、意味埋め込み、変更ログを含む構造化タプルとしてアーカイブする。
記憶検索（時系列/意味情報検索）、OCR、物体検知を含むマルチグラニュラー知覚ツールキットを用いて証拠を収集する。
ReAct風の推論フレームワーク内で計画とツール使用決定を最適化するためにAgentic Reinforcement Learning（GRPOベース）を適用する。
フレームを1FPSでサンプリングし、計算を抑制するため固定STM容量を維持しつつ長距離コンテキストを保持する。

実験結果

リサーチクエスチョン

RQ1 fixedコンテキストウィンドウ下で階層メモリ構造を用いてオンライン動画理解をどう向上させられるか。
RQ2マルチグラニュラー知覚を備えた能動的・エージェント的フレームワークは、オンライン動画タスクにおいて受動的メモリ手法よりも性能を向上させられるか。
RQ3エンドツーエンドのAgentic RLはオンライン動画分析の推論とツール使用戦略をどの程度内在化できるか。
RQ4イベント中心のメモリとリザーバサンプリングは長いストリームにおける意味連続性の保持にどのような影響を与えるか。

主な発見

Model	Params	Frames	Real-Time Visual Perception	Backward Tracing	Forward Active Responding	Overall	OCR	ACR	ATR	STU	FPD	OJR	Avg.	EPM	ASI	HLD	Avg.	REC	SSR	CRR	Avg.
Ours	8B	≤ 32	75.84	69.72	73.28	55.62	67.33	67.93	68.29	59.60	70.95	43.55	58.03	33.67	72.02	62.08	55.92	60.75

EventMemAgentは、入力フレームが少なくてもオンライン動画ベンチマークで競合的な結果を達成する（例：OVO-Benchで32フレーム）。
階層メモリは、固定長メモリと比較して意味的断片化を減らし、安定性と精度を向上させる。
マルチグラニュラー知覚ツール（OCR、物体検知、記憶検索）は、細部の捉えと推論の改善に不可欠である。
エージェント的RLの訓練はツール使用戦略を内在化し、推論中のツール呼び出しをより柔軟で効果的にする。
OVO-BenchではEventMemAgentはオープンソースモデルを上回り、Real-Time Visual Perception、Backward Tracing、Forward Active Respondingの分野でプロプライエタリモデルに近づく/競合する結果を提供する。
StreamingBenchでは、限定的な入力フレームで多様なリアルタイム理解タスクにおいて高い性能を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。