Skip to main content
QUICK REVIEW

[论文解读] Robust Egocentric Visual Attention Prediction Through Language-guided Scene Context-aware Learning

Sungjune Park, Hongda Mao|arXiv (Cornell University)|Jan 5, 2026
Visual Attention and Saliency Detection被引用 0
一句话总结

论文引入一种语言引导、场景上下文感知的框架,配备上下文感知器以预测自我中心注意力,通过聚焦 PoI 并抑制干扰,在 Ego4D 和 AEA 上实现了最先进的结果。

ABSTRACT

As the demand for analyzing egocentric videos grows, egocentric visual attention prediction, anticipating where a camera wearer will attend, has garnered increasing attention. However, it remains challenging due to the inherent complexity and ambiguity of dynamic egocentric scenes. Motivated by evidence that scene contextual information plays a crucial role in modulating human attention, in this paper, we present a language-guided scene context-aware learning framework for robust egocentric visual attention prediction. We first design a context perceiver which is guided to summarize the egocentric video based on a language-based scene description, generating context-aware video representations. We then introduce two training objectives that: 1) encourage the framework to focus on the target point-of-interest regions and 2) suppress distractions from irrelevant regions which are less likely to attract first-person attention. Extensive experiments on Ego4D and Aria Everyday Activities (AEA) datasets demonstrate the effectiveness of our approach, achieving state-of-the-art performance and enhanced robustness across diverse, dynamic egocentric scenarios.

研究动机与目标

  • 通过利用全局场景上下文来推动鲁棒的自我中心视觉注意力预测。
  • 将语言派生的场景描述整合到上下文理解中进行引导。
  • 提升对目标 PoI 区域的关注,同时抑制对干扰区域的注意。

提出的方法

  • 引入一个上下文感知器,包含上下文摘要提取器和在语言场景描述引导下的上下文摘要引导器。
  • 通过 VideoChat2 预计算场景描述,并使用 NV-Embed-v2 进行嵌入,以引导上下文提取。
  • 应用上下文编码损失以将上下文令牌与场景描述对齐。
  • 使用负区域损失对比目标 PoI 与附近的伪负样本。
  • 使用区域抑制损失鼓励在 PoI 上具有高激活并抑制其他区域。
  • 在 Ego4D 和 AEA 上使用基于 MViT 的编码器和基于变换器的解码器进行评估。
Figure 1 : An example showing how contextual cues help predict the point-of-interest region. When humans observe the given scene ( left ), humans can understand the scene context–a red bowl with an egg mixture and a whisk in hand. Therefore, humans easily infer that the red bowl will likely become t
Figure 1 : An example showing how contextual cues help predict the point-of-interest region. When humans observe the given scene ( left ), humans can understand the scene context–a red bowl with an egg mixture and a whisk in hand. Therefore, humans easily infer that the red bowl will likely become t

实验结果

研究问题

  • RQ1语言引导的场景上下文如何提升自我中心视觉注意力预测?
  • RQ2上下文感知器是否能有效将场景描述转化为上下文感知的视频特征?
  • RQ3负区域损失与区域抑制损失是否提升 PoI 定位并降低干扰?
  • RQ4所提出的方法在 Ego4D 和 AEA 上的表现如何,包括未见数据场景?

主要发现

MethodEgo4D F1Ego4D RecallEgo4D PrecisionAEA F1AEA RecallAEA Precision
GazeMLE (flow)36.352.527.856.864.151.0
AttnTrans (flow)37.055.027.957.465.551.0
CSTS (audio)39.753.331.659.966.854.3
I3D-R5036.952.128.657.463.652.2
DFG37.253.228.657.463.652.3
MViT37.254.128.357.562.453.3
DFG+37.352.329.057.665.551.3
GLC37.852.929.458.365.452.6
Ours40.154.131.960.367.254.7
  • 在 Ego4D 上达到 F1 为 40.1,在 AEA 上达到 60.3 的最先进水平,具备更高的召回率和具有竞争力的精确度。
  • 在零-shot 和标准设置下,超过使用辅助模态(如音频/光流)的基线和方法。
  • 消融研究显示负区域损失、区域抑制损失和上下文感知器各自带来增益;三者结合在 Ego4D 提升了 +2.7 的 F1,在 AEA 提升了 +2.6 的 F1,相对基线。
  • 上下文摘要令牌与场景描述有意义对齐,表明语言引导的上下文捕获已取得成功。
  • 零-shot 评估(在 Ego4D 训练、对未见 AEA 测试)得到 53.7 的 F1,展示了鲁棒的泛化能力。
Figure 2 : The examples of scene summary descriptions, which include location, action, and object information (e.g., living room, reaching for a remote control, and TV) related with the first person.
Figure 2 : The examples of scene summary descriptions, which include location, action, and object information (e.g., living room, reaching for a remote control, and TV) related with the first person.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。