QUICK REVIEW

[논문 리뷰] Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering

Yura Choi, Roy Miles|arXiv (Cornell University)|2026. 03. 13.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

요약: 논문은 EgoPointVQA라는 제스처 기반 VQA 데이터셋과 HINT를 소개한다. HINT는 3D 손 관절을 토큰으로 인코딩하고 이를 시각/텍스트 입력과 교차 배열하여 아이코듭 인지 기반 바탕에서의 제스처 기반 가이드링을 개선한다. HINT는 다양한 백본에서 EgoPointVQA에 대한 최첨단 성능을 달성한다.

ABSTRACT

Understanding and answering questions based on a user's pointing gesture is essential for next-generation egocentric AI assistants. However, current Multimodal Large Language Models (MLLMs) struggle with such tasks due to the lack of gesture-rich data and their limited ability to infer fine-grained pointing intent from egocentric video. To address this, we introduce EgoPointVQA, a dataset and benchmark for gesture-grounded egocentric question answering, comprising 4000 synthetic and 400 real-world videos across multiple deictic reasoning tasks. Built upon it, we further propose Hand Intent Tokens (HINT), which encodes tokens derived from 3D hand keypoints using an off-the-shelf reconstruction model and interleaves them with the model input to provide explicit spatial and temporal context for interpreting pointing intent. We show that our model outperforms others in different backbones and model sizes. In particular, HINT-14B achieves 68.1% accuracy, on average over 6 tasks, surpassing the state-of-the-art, InternVL3-14B, by 6.6%. To further facilitate the open research, we will release the code, model, and dataset. Project page: https://yuuraa.github.io/papers/choi2026egovqa

연구 동기 및 목표

Egocentric VQA에서 지시적 참조("this" 또는 "that")를 해결하기 위한 포인팅 제스처 이해의 필요성 모티베이트.
Temporal하고 spatial grounding를 다루는 지시적 질문을 위한 EgoPointVQA 데이터셋 구축.
3D 손 관절 토큰으로 명시적 손 제스처 컨텍스트를 ML LLM에 주입하는 HINT 제안.
제스처 인식 토큰이 바탕의 grounding과 전반적 VQA 정확도 향상에 미치는 효과를 백본 전반에서 입증.
제스처 기반 VQA에 대한 연구를 촉진하기 위한 오픈 리소스(데이터셋, 코드, 모델) 제공

제안 방법

EgoPointVQA 도입: 4,000개의 합성 및 400개의 실제 자가 시점 영상과 6가지 작업 유형의 지시적 질문.
HINT로 제스처 정보를 인코딩: 경량 Keypoint Adapter가 프레임당 21개의 손 관절을 프레임 정렬 Hand Intent Token H_t로 변환.
H_t를 시각 토큰 V_t 및 일반 텍스트 프롬프트와 교차 배열하여 MLLM에 시퀀스를 입력할 때 제스처, 공간, 시간에 대해 공동 추론 가능.
3D 손 포즈 추정(WiLoR)을 사용해 K_t를 추출하고 작은 신경 어댑터를 통해 H_t로 투사하며, c_t>=tau의 임계치를 적용해 토큰 삽입 여부를 결정.
실제 데이터와 합성 데이터를 혼합하여 학습하고, 비전 인코더와 LLM에 대해 LoRA 미세조정 수행; 실제 EgoPointVQA 테스트 세트의 32프레임 비디오 샘플로 평가.
SFT, 손 의도 변형, 데이터 구성 및 제스처 토큰 구성 등을 비교하는 Ablation을 통해 HINT의 이득을 고립해 보여줌

Figure 2 : Task taxonomy and examples from EgoPointVQA . EgoPointVQA includes six subsets of questions regarding the properties of a pointed object. Each example shows egocentric video frames and a question using deictic references. Tasks include reference (object identification), counting (number o

실험 결과

연구 질문

RQ1제스처 기반 큐를 통해 자가시점 VQA에서 지시 참조를 얼마나 효과적으로 해결할 수 있는가?
RQ23D 손 관절 토큰을 도입하면 다양한 백본에서 포인팅 질문의 grounding 정확도가 개선되는가?
RQ3합성 데이터와 실제 데이터가 제스처 기반 VQA 성능에 어떤 영향을 미치는가?
RQ4다양한 손 의도 표현과 토큰 임계값이 작업 성능과 지연에 어떤 영향을 미치는가?

주요 결과

Method	Size	LLM	Refer.	Temporal	Spatial	Count	Attr.	Feed.	Avg.
Random	-	-	20.0	20.0	27.0	20.0	20.0	50.0	26.2
GPT-5	-	-	75.6	53.6	62.3	50.0	56.1	77.8	62.6
GPT-4o	-	-	56.1	29.5	43.1	44.8	41.5	65.7	46.8
Qwen3-VL 32B	32B	Qwen3	63.7	67.9	65.8	66.7	63.4	77.2	67.5
InternVL2.5	38B	InternLM2.5	61.3	57.1	60.5	39.6	63.4	77.2	59.9
InternVL3	38B	InternLM3	70.2	67.9	65.8	45.8	65.9	78.9	65.8
LLaVA-OneVision	72B	Qwen2	61.3	44.6	60.5	41.7	51.2	72.3	55.3
VGLLM-QA	8B	Qwen2.5	57.7	35.7	53.5	39.6	36.6	70.2	48.9
InternVL3-14B	14B	InternLM3	73.8	69.6	64.9	54.2	63.4	82.5	68.1
InternVL3-8B	8B	InternLM3	71.4	71.4	62.3	45.8	68.3	80.1	66.6
HINT (LLaVA-OneVision 7B)	7B	Qwen2	60.7	50.0	56.1	39.6	48.8	71.1	54.4	HA
HINT (InternVL3-8B)	8B	InternLM3	75.0	66.1	64.9	61.0	79.8	63.7	63.7

EgoPointVQA는 기존 모델에 도전적이며, 작업 간 평균 정확도가 70% 미만이다.
HINT는 백본 전반에서 일관되게 성능을 향상시키며, 특히 Reference/grounding 정확도를 높인다(예: InternVL3-14B에서 63.1%에서 73.8%로 증가).
실제 데이터에 합성 데이터를 추가하면 전반적으로 최상 결과를 얻는 경향이 있다(참조 75.0%, 시간적 66.1% 등 혼합 설정에서).
학습된 3D 관절 어댑터(HINT)가 시각 프롬프트나 좌표 입력보다 손 의도 모델링에 우수하다.
HINT 사용 시 추론 시간이 약간 증가(InternVL3-8B에서 2.84초 대 2.58초 베이스라인)하고 제스처 토큰은 총 토큰의 1% 이하를 차지한다.
Ablation 결과 SFT + HINT 결합이 가장 큰 이득을 보이며(예: 참조 정확도 75.0%)

Figure 3 : Visualization of synthetic videos in EgoPointVQA . Our synthetic data covers diverse indoor scenes with various lighting conditions.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.