QUICK REVIEW

[论文解读] Seeing Eye to Eye: Enabling Cognitive Alignment Through Shared First-Person Perspective in Human-AI Collaboration

Zhuyu Teng, Pei Chen|arXiv (Cornell University)|Mar 13, 2026

Human-Automation Interaction and Safety被引用 0

一句话总结

Eye2Eye 提出一个基于第一人称视角的框架，通过共同注意、累积共同点和反思性反馈实现人机认知对齐，已在一个 AR 原型中实现并通过用户研究进行评估。

ABSTRACT

Despite advances in multimodal AI, current vision-based assistants often remain inefficient in collaborative tasks. We identify two key gulfs: a communication gulf, where users must translate rich parallel intentions into verbal commands due to the channel mismatch , and an understanding gulf, where AI struggles to interpret subtle embodied cues. To address these, we propose Eye2Eye, a framework that leverages first-person perspective as a channel for human-AI cognitive alignment. It integrates three components: (1) joint attention coordination for fluid focus alignment, (2) revisable memory to maintain evolving common ground, and (3) reflective feedback allowing users to clarify and refine AI's understanding. We implement this framework in an AR prototype and evaluate it through a user study and a post-hoc pipeline evaluation. Results show that Eye2Eye significantly reduces task completion time and interaction load while increasing trust, demonstrating its components work in concert to improve collaboration.

研究动机与目标

识别妨碍可穿戴 AI 合作的沟通与理解鸿沟。
提出 Eye2Eye 将第一人称视角转换为共享感知通道以实现认知对齐。
在 AR 原型中实例化 Eye2Eye，并通过用户研究和流水线评估验证其有效性。

提出的方法

通过三个核心组件：共同注意协同、累积共同认知、以及反思性情境反馈，定义并将 Eye2Eye 付诸实现。
开发基于 Apple Vision Pro 的 AR 原型，以实现实时多模态感知与反馈。
实现一个对象-卡片记忆模块，持续累积并修正交互历史。
建立一个两阶段注意力管线：轻量感知随后通过视觉-语言模型进行语义解释。
采用检索增强记忆工作流，以用用户更正和新交互来更新上下文。

实验结果

研究问题

RQ1在实时任务中，是否共享的第一人称视角能够建立并维持人机之间对齐的注意力？
RQ2与基线可穿戴助手相比，Eye2Eye 是否降低对接成本、降低交互摩擦、并提高信任？
RQ3多模态信号（凝视、手势、语言）如何帮助形成和更新共同点？
RQ4在多轮交互中，持续的对象-卡片记忆在维持认知对齐方面起到何种作用？

主要发现

Eye2Eye 在协作任务中显著降低任务完成时间和交互负担。
该框架提升了用户对 AI 合作者的信任。
多模态表达对建立共同点具有独特贡献。
流水线评估表明在系统中整合所有组件时存在协同效应的潜力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。