QUICK REVIEW

[論文レビュー] Robust Egocentric Visual Attention Prediction Through Language-guided Scene Context-aware Learning

Sungjune Park, Hongda Mao|arXiv (Cornell University)|Jan 5, 2026

Visual Attention and Saliency Detection被引用数 0

ひとこと要約

この論文は、エゴセントリックな注意予測を行うために、PoIに焦点を当て、注意散漫を抑制する言語誘導の場景文脈認識フレームワークとコンテキストパーシーバを提案し、Ego4DとAEAで最先端の結果を達成します。

ABSTRACT

As the demand for analyzing egocentric videos grows, egocentric visual attention prediction, anticipating where a camera wearer will attend, has garnered increasing attention. However, it remains challenging due to the inherent complexity and ambiguity of dynamic egocentric scenes. Motivated by evidence that scene contextual information plays a crucial role in modulating human attention, in this paper, we present a language-guided scene context-aware learning framework for robust egocentric visual attention prediction. We first design a context perceiver which is guided to summarize the egocentric video based on a language-based scene description, generating context-aware video representations. We then introduce two training objectives that: 1) encourage the framework to focus on the target point-of-interest regions and 2) suppress distractions from irrelevant regions which are less likely to attract first-person attention. Extensive experiments on Ego4D and Aria Everyday Activities (AEA) datasets demonstrate the effectiveness of our approach, achieving state-of-the-art performance and enhanced robustness across diverse, dynamic egocentric scenarios.

研究の動機と目的

グローバルな場景文脈を活用して頑健なエゴセントリック視覚注意予測を動機づける。
言語由来の場面説明を組み込み、文脈理解を導く。
対象PoI領域へのフォーカスを改善し、ディスカトラクターへの注意を抑制する。

提案手法

コンテキストサマリーエクストラクタと言語ベースの場面説明に導かれたコンテキストサマリガイダを含むコンテキストパーシーバを導入する。
VideoChat2を介して場面説明を事前計算し、NV-Embed-v2で埋め込み、文脈抽出を誘導する。
コンテキストトークンを場面説明と整合させるコンテキストエンコーディング損失を適用する。
対象PoIと近傍の疑似ネガティブを対比させるネガティブ領域損失を用いる。
PoIで高い活性化を促し、他を抑制する領域抑制損失を用いる。
Ego4DとAEAでMViTベースのエンコーダとトランスフォーマーベースのデコーダを用いて評価する。

Figure 1 : An example showing how contextual cues help predict the point-of-interest region. When humans observe the given scene ( left ), humans can understand the scene context–a red bowl with an egg mixture and a whisk in hand. Therefore, humans easily infer that the red bowl will likely become t

実験結果

リサーチクエスチョン

RQ1言語誘導場景文脈はエゴセントリック視覚注意予測をどのように改善できるか。
RQ2文脈パーシーバは場面説明を効果的に文脈認識ビデオ特徴へ翻訳できるか。
RQ3ネガティブ領域損失と領域抑制損失はPoIの局在化と分散の低減に寄与するか。
RQ4提案手法はEgo4DとAEAで、未知データシナリオを含む場合にどのようにパフォーマンスを発揮するか。

主な発見

Method	Ego4D F1	Ego4D Recall	Ego4D Precision	AEA F1	AEA Recall	AEA Precision
GazeMLE (flow)	36.3	52.5	27.8	56.8	64.1	51.0
AttnTrans (flow)	37.0	55.0	27.9	57.4	65.5	51.0
CSTS (audio)	39.7	53.3	31.6	59.9	66.8	54.3
I3D-R50	36.9	52.1	28.6	57.4	63.6	52.2
DFG	37.2	53.2	28.6	57.4	63.6	52.3
MViT	37.2	54.1	28.3	57.5	62.4	53.3
DFG+	37.3	52.3	29.0	57.6	65.5	51.3
GLC	37.8	52.9	29.4	58.3	65.4	52.6
Ours	40.1	54.1	31.9	60.3	67.2	54.7

Ego4DでF1が40.1、AEAで60.3となり、リコールが高く、精度は競合的。
推論時に補助モダリティ（例：音声/フロー）を用いたベースラインや手法よりも零ショットおよび標準設定で上回る。
アブレーションにより、ネガティブ領域損失、領域抑制損失、そしてコンテキストパーシーバのそれぞれが利益をもたらし、組み合わせで基準よりEgo4Dで+2.7 F1、AEAで+2.6 F1を得る。
文脈サマリトークンは場面説明と意味的に整列し、言語誘導による文脈捉えが成功していることを示す。
零ショット評価（Ego4Dで学習、未知のAEAでテスト）は53.7 F1となり、頑健な一般化を示す。

Figure 2 : The examples of scene summary descriptions, which include location, action, and object information (e.g., living room, reaching for a remote control, and TV) related with the first person.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。