QUICK REVIEW

[論文レビュー] Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering

Yura Choi, Roy Miles|arXiv (Cornell University)|Mar 13, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

本研究は、ジェスチャーに基づく VQA データセット EgoPointVQA と、3D 手のキーポイントをトークンとしてエンコードし、視覚・テキスト入力と埋め込みを組み合わせて自己視点 VQA のジェスチャー grounding を改善する HINT を提案する。HINT は複数の backbone で EgoPointVQA における最先端の結果を達成する。

ABSTRACT

Understanding and answering questions based on a user's pointing gesture is essential for next-generation egocentric AI assistants. However, current Multimodal Large Language Models (MLLMs) struggle with such tasks due to the lack of gesture-rich data and their limited ability to infer fine-grained pointing intent from egocentric video. To address this, we introduce EgoPointVQA, a dataset and benchmark for gesture-grounded egocentric question answering, comprising 4000 synthetic and 400 real-world videos across multiple deictic reasoning tasks. Built upon it, we further propose Hand Intent Tokens (HINT), which encodes tokens derived from 3D hand keypoints using an off-the-shelf reconstruction model and interleaves them with the model input to provide explicit spatial and temporal context for interpreting pointing intent. We show that our model outperforms others in different backbones and model sizes. In particular, HINT-14B achieves 68.1% accuracy, on average over 6 tasks, surpassing the state-of-the-art, InternVL3-14B, by 6.6%. To further facilitate the open research, we will release the code, model, and dataset. Project page: https://yuuraa.github.io/papers/choi2026egovqa

研究の動機と目的

自己視点 VQA における指差ジェスチャーの理解を動機づけ、指示参照（例：「これ」や「それ」）の解決を目指す。
指示的質問を含む時間的・空間的 grounding を対象としたジェスチャー grounded データセット EgoPointVQA を作成する。
MLLMs に対して 3D 手のキーポイント・トークンを介して明示的な手ジェスチャー文脈を注入する HINT を提案する。
ジェスチャーを意識したトークンが grounding と全体の VQA 精度を backbone 間で改善することを示す。
ジェスチャー grounded VQA のさらなる研究を促進するオープンリソース（データセット、コード、モデル）を提供する。

提案手法

EgoPointVQA を導入: 4,000 の合成動画 + 400 の実写自己視点動画を、6つのタスクタイプにわたる deictic 質問と共に収集。
HINT: ジェスチャー情報をエンコードする軽量な Keypoint Adapter が1フレームあたりの21個の手のキーポイントをフレーム整列済みの Hand Intent Token H_t に変換。
H_t を視覚トークン V_t や標準的なテキストプロンプトと交差して MLLM にシーケンスとして供給し、ジェスチャー・空間・時間の結合推論を可能にする。
3D 手のポーズ推定（WiLoR）を用いて K_t を抽出し、小さなニューラルアダプターを介して H_t に射影し、信頼度閾値 c_t>=tau でトークン挿入を決定。
実データと合成データを混合して学習し、 vision encoder と LLM に LoRA 微調整を適用する。実データを含む EgoPointVQA テストセットの 32 フレーム動画で評価。
SFT、手の意図バリアント、データ構成、ジェスチャートークン設定の ablation を実施し、HINT 由来の利得を分離・確認する。

Figure 2 : Task taxonomy and examples from EgoPointVQA . EgoPointVQA includes six subsets of questions regarding the properties of a pointed object. Each example shows egocentric video frames and a question using deictic references. Tasks include reference (object identification), counting (number o

実験結果

リサーチクエスチョン

RQ1ジェスチャー grounded キューは自己視点 VQA における指示参照をどれくらい効果的に解決できるか？
RQ23D 手のキーポイントトークンを取り入れると、指差質問の grounding 精度は多様な backbone で向上するか？
RQ3 synthetic データと real データの組み合わせがジェスチャー grounded VQA に与える影響は？
RQ4異なる手の意図表現とトークン閾値は、タスクの性能とレイテンシにどのように影響するか？

主な発見

Method	Size	LLM	Refer.	Temporal	Spatial	Count	Attr.	Feed.	Avg.
Random	-	-	20.0	20.0	27.0	20.0	20.0	50.0	26.2
GPT-5	-	-	75.6	53.6	62.3	50.0	56.1	77.8	62.6
GPT-4o	-	-	56.1	29.5	43.1	44.8	41.5	65.7	46.8
Qwen3-VL 32B	32B	Qwen3	63.7	67.9	65.8	66.7	63.4	77.2	67.5
InternVL2.5	38B	InternLM2.5	61.3	57.1	60.5	39.6	63.4	77.2	59.9
InternVL3	38B	InternLM3	70.2	67.9	65.8	45.8	65.9	78.9	65.8
LLaVA-OneVision	72B	Qwen2	61.3	44.6	60.5	41.7	51.2	72.3	55.3
VGLLM-QA	8B	Qwen2.5	57.7	35.7	53.5	39.6	36.6	70.2	48.9
InternVL3-14B	14B	InternLM3	73.8	69.6	64.9	54.2	63.4	82.5	68.1
InternVL3-8B	8B	InternLM3	71.4	71.4	62.3	45.8	68.3	80.1	66.6
HINT (LLaVA-OneVision 7B)	7B	Qwen2	60.7	50.0	56.1	39.6	48.8	71.1	54.4	HA
HINT (InternVL3-8B)	8B	InternLM3	75.0	66.1	64.9	61.0	79.8	63.7	63.7

EgoPointVQA は既存モデルにとって難しく、タスク間の平均精度は70%未満。
HINT は backbone を問わず一貫して性能を向上させ、特に Reference/grounding 精度を高める（例: InternVL3-14B で 63.1% から 73.8% へ）。
実データに合成データを追加することで全体最適が最も良くなり（組み合わせ設定で 75.0% の reference、66.1% の temporal 等）。
learned 3D keypoint adapter (HINT) は hand intent モデリングにおいて視覚プロンプトや直接座標入力より優れている。
HINT の推論時間はわずかな遅延を伴い（InternVL3-8B の場合 2.84s vs 2.58s baseline）、ジェスチャートークンは総トークンの 1% 未満を占める。
アブレーションでは SFT + HINT の組み合わせが最も大きな利得を生み、例えば reference 精度は 75.0% に達する。

Figure 3 : Visualization of synthetic videos in EgoPointVQA . Our synthetic data covers diverse indoor scenes with various lighting conditions.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。