QUICK REVIEW

[논문 리뷰] Robust Egocentric Visual Attention Prediction Through Language-guided Scene Context-aware Learning

Sungjune Park, Hongda Mao|arXiv (Cornell University)|2026. 01. 05.

Visual Attention and Saliency Detection인용 수 0

한 줄 요약

이 논문은 PoIs에 초점을 맞추고 산만함을 억제하는 언어-가이드형, 장면 맥락 인식 프레임워크의 context perceiver를 도입하여 egocentric 주의 예측에서 Ego4D와 AEA에서 최첨단 결과를 달성한다.

ABSTRACT

As the demand for analyzing egocentric videos grows, egocentric visual attention prediction, anticipating where a camera wearer will attend, has garnered increasing attention. However, it remains challenging due to the inherent complexity and ambiguity of dynamic egocentric scenes. Motivated by evidence that scene contextual information plays a crucial role in modulating human attention, in this paper, we present a language-guided scene context-aware learning framework for robust egocentric visual attention prediction. We first design a context perceiver which is guided to summarize the egocentric video based on a language-based scene description, generating context-aware video representations. We then introduce two training objectives that: 1) encourage the framework to focus on the target point-of-interest regions and 2) suppress distractions from irrelevant regions which are less likely to attract first-person attention. Extensive experiments on Ego4D and Aria Everyday Activities (AEA) datasets demonstrate the effectiveness of our approach, achieving state-of-the-art performance and enhanced robustness across diverse, dynamic egocentric scenarios.

연구 동기 및 목표

글로벌 장면 맥락을 활용해 robust egocentric visual attention prediction을motiv고
언어에서 파생된 장면 설명을 도입하여 맥락 이해를 안내
목표 PoI 영역에 대한 집중을 개선하고 산란요소에 대한 주의를 억제

제안 방법

언어 기반 장면 설명으로 안내되는 맥락 요약 추출기와 맥락 요약 가이더를 갖춘 context perceiver 도입
VideoChat2를 통해 장면 설명을 사전 계산하고 NV-Embed-v2로 임베딩하여 맥락 추출을 가이드
맥락 토큰을 장면 설명과 정렬시키는 맥락 인코딩 손실 적용
대상 PoI를 근접한 위양 음영과 대조하는 negative region 손실 도입
PoIs에 대해 높은 활성화를 유도하고 다른 영역을 억제하도록 region suppression 손실 도입
Ego4D와 AEA에서 MViT 기반 인코더 및 트랜스포머 기반 디코더로 평가

Figure 1 : An example showing how contextual cues help predict the point-of-interest region. When humans observe the given scene ( left ), humans can understand the scene context–a red bowl with an egg mixture and a whisk in hand. Therefore, humans easily infer that the red bowl will likely become t

실험 결과

연구 질문

RQ1언어가 안내하는 장면 맥 context가 egocentric visual attention prediction에 어떠한 이점을 제공하는가?
RQ2context perceiver가 장면 설명을 효과적으로 맥락 인식 비디오 특징으로 번역할 수 있는가?
RQ3negative region 손실 및 region suppression 손실이 PoI 위치화 및 산만함 축소에 기여하는가?
RQ4제안된 방법이 Ego4D와 AEA에서, 보지 못한 데이터 상황을 포함하여 어떻게 성능을 보이는가?

주요 결과

Method	Ego4D F1	Ego4D Recall	Ego4D Precision	AEA F1	AEA Recall	AEA Precision
GazeMLE (flow)	36.3	52.5	27.8	56.8	64.1	51.0
AttnTrans (flow)	37.0	55.0	27.9	57.4	65.5	51.0
CSTS (audio)	39.7	53.3	31.6	59.9	66.8	54.3
I3D-R50	36.9	52.1	28.6	57.4	63.6	52.2
DFG	37.2	53.2	28.6	57.4	63.6	52.3
MViT	37.2	54.1	28.3	57.5	62.4	53.3
DFG+	37.3	52.3	29.0	57.6	65.5	51.3
GLC	37.8	52.9	29.4	58.3	65.4	52.6
Ours	40.1	54.1	31.9	60.3	67.2	54.7

Ego4D에서 F1 40.1, AEA에서 F1 60.3으로 최첨단에 도달하며 높은 재현율과 경쟁력 있는 정밀도 달성.
추가 모듈(예: 음성/플로우)을 활용한 인퍼런스에서 베이스라인 및 보조 모듀얼리티를 사용하는 방법보다 제로샷 및 표준 설정에서 우수한 성능.
적용 구성요소의 제거/변형에 따른 ablation 결과, negative region 손실, region suppression 손실, context perceiver 각각이 이득을 제공하며, 결합 시 Ego4D에서 +2.7 F1, AEA에서 +2.6 F1를 달성.
맥락 요약 토큰이 장면 설명과 의미상 일치하여 언어 가이드 콘텍스트 캡처가 성공적으로 이루어졌음을 시사.
제로샷 평가(Ego4D에서 학습, 보지 못한 AEA에서 테스트)에서 53.7 F1 달성으로 강건한 일반화 확인

Figure 2 : The examples of scene summary descriptions, which include location, action, and object information (e.g., living room, reaching for a remote control, and TV) related with the first person.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.