Skip to main content
QUICK REVIEW

[论文解读] Structure Matters: Evaluating Multi-Agents Orchestration in Generative Therapeutic Chatbots

Sina Elahimanesh, Mohammadali Mohammadkhani|arXiv (Cornell University)|Feb 28, 2026
Digital Mental Health Interventions被引用 0
一句话总结

该研究比较三种基于大型语言模型的治疗聊天机器人架构(具长期记忆的多代理有限状态机、多代理记忆型;单代理带SAT知识;无引导的GPT-4o)并在波斯语SAT治疗场景中发现多代理有限状态机设计在对话自然度、人与人类似度和互动质量方面显著优于其他方案。

ABSTRACT

While large language models (LLMs) excel at open-ended dialogue, effective psychotherapy requires structured progression and adherence to clinical protocols, making the design of psychotherapist chatbots challenging. We investigate how different LLM-based designs shape perceived therapeutic dialogue in a chatbot grounded in the Self-Attachment Technique (SAT), a novel self-administered psychotherapy rooted in attachment theory. We compare three architectural variants: (1) a multi-agent system utilizing finite state machine aligned with therapeutic stages and a shared long-term memory, (2) a single-agent using identical knowledge-base and the same prompts, and (3) an unguided LLM. In an eight-day randomized controlled trial (RCT) with N=66 Farsi-speaking participants, balanced across the three chatbots, the multi-agent system is perceived as significantly more natural and human-like than the other variants and achieves higher ratings across most other metrics. These findings demonstrate that for therapeutic AI, architectural orchestration is as critical as prompt engineering in fostering natural, engaging dialogue.

研究动机与目标

  • 评估LLM驱动治疗性聊天机器人架构设计对感知治疗质量的影响。
  • 在受控条件下比较三种架构(具记忆的多代理FSM;带SAT知识的单代理;无引导LLM)。
  • 考察对自然度、信任、同理心、记忆、满意度和对话焦点的影响。
  • 探讨架构结构如何影响对话动态与参与度的机制。

提出的方法

  • 采用三条件的跨组随机对照试验,N=66,参与者被分配至Alpha(具记忆的多代理FSM)、Beta(带SAT内容的单代理)和Gamma(无引导的单代理)。
  • 所有条件均以GPT-4o作为基础模型;提示词和界面完全相同,提示词使用英文、界面设计为英文,但在波斯语环境下部署。
  • Alpha 使用一个具长期记忆共享、具有自适应的检索增强生成(RAG)的12状态SAT对齐FSM,以实现个性化练习。
  • Beta 使用相同的SAT内容与练习,但依赖一个单一提示词,未显式实现FSM约束。
  • Gamma 提供一个最小化的LLM设置,不包含SAT知识或结构化目标。
  • 生成了纵向记忆摘要,且用日历模型跟踪第1天至第8天的进展。
Figure 1. Overview of the user study comprising three phases: (1) recruitment and blinded RCT group assignment; (2) an eight-day study period during which participants interacted with one of three therapeutic chatbot versions, multi-agent FSM-based, single-agent with therapy knowledge, or unguided s
Figure 1. Overview of the user study comprising three phases: (1) recruitment and blinded RCT group assignment; (2) an eight-day study period during which participants interacted with one of three therapeutic chatbot versions, multi-agent FSM-based, single-agent with therapy knowledge, or unguided s

实验结果

研究问题

  • RQ1架构编排(具记忆的多代理FSM)是否相较带SAT的单代理系统和无引导LLM提高感知自然度?
  • RQ2在不同架构下会出现哪些具体的对话动态(轮次、消息长度、代理/用户消息比)?
  • RQ3架构差异在多大程度上影响信任、同理心、记忆连贯性和满意度等SAT信息化聊天机器人中的变量?
  • RQ4这三种系统在 Eight-day 试验中的治疗推进和记忆维护方面有哪些表现?

主要发现

  • Alpha 的自然度和人类化程度显著高于 Beta 和 Gamma(均值3.955,SD 0.950;Beta 3.043,SD 0.825;Gamma 3.211,SD 0.787)。
  • 统计检验显示 F=7.017,p_perm=0.0018,eta^2=0.187,表明架构设计解释了约19%的评分方差。
  • Alpha 产出更多但更短的消息(总计459条消息;约230字符))相较 Beta 的336条(约409字符)和 Gamma 的206条(约635字符)。
  • 参与者在 Alpha 中发送的用户消息平均较短(29.0字符)低于 Beta(38.9)和 Gamma(42.8)。
  • Alpha 的对话动态显示代理对用户的消息比率较低(7.9:1)相比 Beta(10.5:1)和 Gamma(13.4:1)。
  • 表1 显示 Alpha 在大多数互动指标上优于 Beta 和 Gamma,尤其在自然度方面;在可用性测量方面,各条件相近。
Figure 2. Screenshot of the web-based user interface of the chatbot. After logging in, users are directed to the home screen where they can start interacting with the chatbot. (A) shows the list of user messages and corresponding chatbot responses. (B) is the input area for composing and sending mes
Figure 2. Screenshot of the web-based user interface of the chatbot. After logging in, users are directed to the home screen where they can start interacting with the chatbot. (A) shows the list of user messages and corresponding chatbot responses. (B) is the input area for composing and sending mes

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。