QUICK REVIEW

[論文レビュー] How does longer temporal context enhance multimodal narrative video processing in the brain?

Prachi Jindal, Anant Khandelwal|arXiv (Cornell University)|Feb 7, 2026

Action Observation and Synchronization被引用数 0

ひとこと要約

この研究は、より長い時系列コンテキスト（3–12秒クリップ）が、自然主義的な映画視聴中のマルチモーダル動画–音声LLMの脳整合性を改善することを示しており、ROIおよび層依存的なパターンを示す一方、単一モーダルの動画モデルではほとんど利得がない。

ABSTRACT

Understanding how humans and artificial intelligence systems process complex narrative videos is a fundamental challenge at the intersection of neuroscience and machine learning. This study investigates how the temporal context length of video clips (3--12 s clips) and the narrative-task prompting shape brain-model alignment during naturalistic movie watching. Using fMRI recordings from participants viewing full-length movies, we examine how brain regions sensitive to narrative context dynamically represent information over varying timescales and how these neural patterns align with model-derived features. We find that increasing clip duration substantially improves brain alignment for multimodal large language models (MLLMs), whereas unimodal video models show little to no gain. Further, shorter temporal windows align with perceptual and early language regions, while longer windows preferentially align higher-order integrative regions, mirrored by a layer-to-cortex hierarchy in MLLMs. Finally, narrative-task prompts (multi-scene summary, narrative summary, character motivation, and event boundary detection) elicit task-specific, region-dependent brain alignment patterns and context-dependent shifts in clip-level tuning in higher-order regions. Together, our results position long-form narrative movies as a principled testbed for probing biologically relevant temporal integration and interpretable representations in long-context MLLMs.

研究の動機と目的

人間とAIが長編ネタ動画をどう処理するかと、時系列コンテキストが脳-モデルの整合性に果たす役割を理解する動機づけ。
異なるクリップ長にわたって、マルチモーダル動画–音声LLMと単一モダリティ動画モデルの脳予測性を評価する。
ナラティブ課題プロンプトがROI特異的な脳整合性とモデル層対応をどう形成するかを調査する。
どの動画クリップとプロンプトがボクセル応答を最も強く駆動し、コンテキスト依存の表象を理解するかを特定する。

提案手法

2つの事前学習済み動画–音声MLLM（Qwen-2.5-OmniとDATE）と2つの単一モ듀 Baseline（TimeSFormer、VideoMAE）を用い、スライディング時系列窓（3、6、9、12秒）と1.49秒のストライドで表現を生成する。
すべてのTransformer層から表現を抽出し、窓ごとおよびタスク指示ごとにトークンを平均する。
ブーツストラップリッジ回帰によるボクセル単位のエンコーディングモデルを構築し、刺激表現からfMRI応答を予測する。
被験者間で脳整合性を正規化するために、クロス被験者予測精度を推定する。
4つのナラティブ課題（Character Motivation、Event Boundary Detection、Multi-Scene Summary、Narrative Summary）をプロンプトとして評価し、課題特異的な表現を得る。
層別およびROI特異的な整合性を分析して、時系列勾配と皮質階層を検討する。

Figure 1: Leveraging temporal video context of different durations ( $X_{\text{windows}}$ ) with unimodal and multimodal models for brain encoding with a diverse set of instructions (prompts). We experiment with 4 narrative video understanding tasks: character motivation, event boundary detection, m

実験結果

リサーチクエスチョン

RQ1RQ1 自然主義的な映画視聴中、時系列コンテキスト長を増やすと、マルチモーダルと単一モードの動画モデルの脳予測性にどのような影響があるか？
RQ2RQ2 どの脳領域が最適コンテキスト長の利得や移動を示し、それらはMLLMの層表現とどのように関連するか？
RQ3RQ3 ナラティブ課題プロンプトは脳整合性にどのように影響し、ROI特異的パターンに分離するか？
RQ4RQ4 どの動画クリップがコンテキストや課題を横断してボクセル応答を最も強く駆動し、ROIごとにパターンはどう変化するか？

主な発見

長い時系列コンテキストは動画–音声LLLMの脳整合性を改善する（Qwen-2.5-Omniで約26%の相対的ゲイン、DATEで約19%、単一モードのベースラインでは変化なし）。
長い窓（12s）は高次意味領域（例：PCC、dmPFC）とより整合しやすく、中間窓（6s）は知覚・初期言語領域（例：PTL）を好む。
ナラティブ課題プロンプトは課題特異的・ROI依存性の整合パターンを生み出す；NarrativeおよびMulti-scene Summariesは高次領域を、Character Motivationは時間的言語領域を関与させ、Event Boundary Detectionはより局在的。
層別解析は皮質言語階層を示す：より深い層は高次の脳領域と整合し、早い層は感覚領域と整合する、時系列コンテキスト全体で。
視覚ROIは文脈に関係なく安定したクリップ嗜好を示す一方、高次領域（AG、PCC）は文脈とプロンプトに応じて移動する。
ボクセル応答のトップ活性クリップは視覚領域では安定だが、より高次領域では文脈が拡大するにつれて移動し、意味的感受性が文脈依存になる。

Figure 2: Average normalized brain alignment as a function of temporal window length (3 to 12s) for MLLMs, and unimodal video baselines. MLLMs show increasing alignment with longer windows, while unimodal video models remain approximately constant. Error bars denote variability across subjects (mean

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。