QUICK REVIEW

[论文解读] A Multimodal Framework for Human-Multi-Agent Interaction

Shaid Hasan, Breenice Lee|arXiv (Cornell University)|Mar 24, 2026

Social Robot Interaction and HRI被引用 0

一句话总结

本文提出一个多模态、由大语言模型驱动的框架，其中每个人形机器人都是具备感知、规划与行动模块的自治认知代理，由一个中心机制协调，以实现共享空间中自然的人类–多代理人交互。

ABSTRACT

Human-robot interaction is increasingly moving toward multi-robot, socially grounded environments. Existing systems struggle to integrate multimodal perception, embodied expression, and coordinated decision-making in a unified framework. This limits natural and scalable interaction in shared physical spaces. We address this gap by introducing a multimodal framework for human-multi-agent interaction in which each robot operates as an autonomous cognitive agent with integrated multimodal perception and Large Language Model (LLM)-driven planning grounded in embodiment. At the team level, a centralized coordination mechanism regulates turn-taking and agent participation to prevent overlapping speech and conflicting actions. Implemented on two humanoid robots, our framework enables coherent multi-agent interaction through interaction policies that combine speech, gesture, gaze, and locomotion. Representative interaction runs demonstrate coordinated multimodal reasoning across agents and grounded embodied responses. Future work will focus on larger-scale user studies and deeper exploration of socially grounded multi-agent interaction dynamics.

研究动机与目标

在共享环境中凸显社会化、多机器人HRI的需求。
提出一个框架，使每个机器人成为具备多模态感知和具象行动的自治认知代理。
demonstrated centralized coordination to manage turn-taking and participation among multiple agents.
在具象化、模组化循环中整合视觉–语言感知、LLM驱动规划和行动执行。

提出的方法

每个机器人都是具有感知、规划、行动执行的模块化闭环代理。
感知使用多模态输入（语音和视觉），通过视觉–语言模型处理以产生结构化观测。
规划采用在结构化输入条件下的LLM，以生成有序、参数化的行动策略，受机器人具象能力的约束。
行动执行一系列参数化原语（语音、手势、凝视、移动等），并返回状态反馈。
集中协调器评估所有代理的响应可能性，以调节轮流与参与，确保不重叠的发言和协同行动。
在两个人形机器人上的演示展示多模态基础与互动情景中的协同行为。

实验结果

研究问题

RQ1如何将多模态感知融合以为多代理HRI产生连贯的交互情境？
RQ2LLM驱动的规划能否生成可执行、具象的行动策略，且尊重每个代理的能力？
RQ3集中协调如何影响轮流、参与度与对齐在人与多代理交互中的表现？
RQ4具象化动作与延迟对感知协调与参与感的可观测影响是什么？

主要发现

该框架实现了多代理的连贯交互，呈现顺序性、非重叠的发言与具象化的响应。
每个机器人的感知–规划–行动循环将语音、视觉与具象行为整合用于情境基础推理。
集中协调避免冲突动作并在代理间强制有结构的轮流。
系统演示了跨代理的分布式推理，每个机器人从自身感知上下文推理以生成定制化响应。
将语言与具象化行动相连接的能力得到证实，通过定向对话和共享互动情境。
演示突出感知质量与延迟对互动动力学和感知协调的影响。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。