QUICK REVIEW

[論文レビュー] ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting

Yeonkyung Lee, Dayun Ju|arXiv (Cornell University)|Mar 24, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

ViKey はフレーム索引の連続的な視覚プロンプトと Keyword–Frame Mapping モジュールを導入し、VideoLLMs の時間的理解を強化。訓練を要さず、プラグアンドプレイでスパースフレームでもより良い推論を実現。

ABSTRACT

Recent advancements in Video Large Language Models (VideoLLMs) have enabled strong performance across diverse multimodal video tasks. To reduce the high computational cost of processing dense video frames, efficiency-oriented methods such as frame selection have been widely adopted. While effective at minimizing redundancy, these methods often cause notable performance drops on tasks requiring temporal reasoning. Unlike humans, who can infer event progression from sparse visual cues, VideoLLMs frequently misinterpret temporal relations when intermediate frames are omitted. To address this limitation, we explore visual prompting (VP) as a lightweight yet effective way to enhance temporal understanding in VideoLLMs. Our analysis reveals that simply annotating each frame with explicit ordinal information helps the model perceive temporal continuity. This visual cue also supports frame-level referencing and mitigates positional ambiguity within a sparsely sampled sequence. Building on these insights, we introduce ViKey, a training-free framework that combines VP with a lightweight Keyword-Frame Mapping (KFM) module. KFM leverages frame indices as dictionary-like keys to link textual cues to the most relevant frames, providing explicit temporal anchors during inference. Despite its simplicity, our approach substantially improves temporal reasoning and, on some datasets, preserves dense-frame baseline performance with as few as 20% of frames.

研究の動機と目的

VideoLLMs におけるフレームサンプリング密度が低下した際の時間的推論の低下を動機づけて解決する。
視覚プロンプティングがモデルの再学習なしで時間的連続性を復元できるかを調査する。
視覚プロンプトとフレームインデックス辞書マッピングを組み合わせた軽量なフレームワークを提案する。
多様な時間的推論ベンチマークと VideoLLMs にわたってアプローチを評価する。

提案手法

各入力フレームに連続的なフレームインデックスプロンプト（例: frame #01）をパラメータを変更せずに挿入する。
KFM（Keyword–Frame Mapping）を開発し、顕著なクエリ語を最も関連性の高いフレームへ結びつける共通埋め込み空間を介してリンクさせる。
推論時に明示的な時間的アンカーを可能にするため、マッピングされたフレームインデックスを含むユーザー問合せを書き換える。
VP の効果を理解するため、位置埋め込みの劣化、フレームレベルの参照、注意機構のパターンを分析する。
訓練なし、プラグアンドプレイ適用性を複数の VideoLLMs や動画タスクにわたって実証する。

実験結果

リサーチクエスチョン

RQ1視覚プロンプトは時間的位置エンコーディングが劣化したときにフレーム順序認識を回復できるか？
RQ2フレーム番号プロンプトは VideoLLMs における辞書型のフレーム参照および逆参照を可能にするか？
RQ3視覚プロンプトは VideoLLMs のクロスモーダル注意と時間的グラウンディングにどのような影響を与えるか？
RQ4VP と KFM を結合することで、再学習なしにスパースフレーム入力での時間的推論が改善されるか？

主な発見

視覚プロンプトは劣化した位置情報の下で時間的理解を一貫して改善し、検証セットで 2.9–9.9 ポイントの向上を示した。
VP はフレームの参照および逆参照を可能にし、フレーム数が増えると大幅な利益を生み出す（いくつかの配置では完璧な精度に達する）。
左下および右下のプロンプト配置は参照および逆参照タスクの両方で高い精度を示し、位置的なバイアスを明らかにする。
VP は層を通じて画像トークンへの注意を高め、中間〜後半層での時空統合を強化する。
VP と KFM を組み合わせると最良の結果を得られ、TempCompass、MVBench、VideoMME、LongVideoBench のベースラインを上回り、20% のフレームのみでも強い性能を発揮する。
一部のデータセットでは、Sparse フレームの ViKey が Dense フレームのベースラインと同等かそれ以上の性能を示し、入力削減に対する頑健性を示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。