[论文解读] Beyond Simulations: What 20,000 Real Conversations Reveal About Mental Health AI Safety
The paper replication-tests safety benchmarks for both a general-purpose LLM and a purpose-built mental health AI, then conducts an ecological audit of over 20,000 real conversations, finding real-world safety outcomes often better than test-set results and highlighting the need for deployment-relevant safety assurance.
Large language models (LLMs) are increasingly used for mental health support, yet existing safety evaluations rely primarily on small, simulation-based test sets that have an unknown relationship to the linguistic distribution of real usage. In this study, we present replications of four published safety test sets targeting suicide risk assessment, harmful content generation, refusal robustness, and adversarial jailbreaks for a leading frontier generic AI model alongside an AI purpose built for mental health support. We then propose and conduct an ecological audit on over 20,000 real-world user conversations with the purpose-built AI designed with layered suicide and non-suicidal self-injury (NSSI) safeguards to compare test set performance to real world performance. While the purpose-built AI was significantly less likely than general-purpose LLMs to produce enabling or harmful content across suicide/NSSI (.4-11.27% vs 29.0-54.4%), eating disorder (8.4% vs 54.0%), and substance use (9.9% vs 45.0%) benchmark prompts, test set failure rates for suicide/NSSI were far higher than in real-world deployment. Clinician review of flagged conversations from the ecological audit identified zero cases of suicide risk that failed to receive crisis resources. Across all 20,000 conversations, three mentions of NSSI risk (.015%) did not trigger a crisis intervention; among sessions flagged by the LLM judge, this corresponds to an end-to-end system false negative rate of .38%, providing a lower bound on real-world safety failures. These findings support a shift toward continuous, deployment-relevant safety assurance for AI mental-health systems rather than limited set benchmark certification.
研究动机与目标
- 评估现有 suicide risk、有害内容、拒绝鲁棒性和对抗性越狱等安全测试集在实际使用情境中的对齐情况,以应用于心理健康 AI。
- 比较通用型大语言模型与专门构建的心理健康支持 AI 在多项安全维度上的表现。
- 在真实对话中量化启用/有害内容以及危机干预的实际发生率和效果。
- 识别基准测试失败与现实世界安全结果之间的差距,以为安全保障实践提供信息。
提出的方法
- 在领先的前沿通用 AI 模型和一个专门构建的心理健康 AI 上复现实验的四个已发表的安全测试集。
- 对使用了分层自杀与非自杀性自残(NSSI)保护措施的专用心理健康 AI 进行超过 2 万条真实世界用户对话的生态审计。
- 在自杀/NSSI、进食障碍、物质使用提示上,将测试集的失败率与真实部署结果进行对比。
- 让临床医生评估标注对话以评估危机干预效果及端到端的安全性。
- 将端到端系统的假阴性率计为现实世界安全失败的下界。
实验结果
研究问题
- RQ1将安全测试集应用于心理健康 AI 系统时,是高估还是低估真实世界风险?
- RQ2专门构建的心理健康 AI 在安全基准测试中的表现如何,相对于通用型大语言模型?
- RQ3在心理健康 AI 的对话中,真实世界启用或触发有害内容的比率是多少,危机资源被成功触发的频率又如何?
- RQ4临床医生的评审显示在现实使用中危机资源部署和安全方面存在哪些差距?
主要发现
- 专门构建的心理健康 AI 在自杀/NSSI、进食障碍、物质使用提示等方面,产生启用或有害内容的可能性显著低于通用型大语言模型(0.4-11.27% vs 29.0-54.4%,8.4% vs 54.0%,9.9% vs 45.0%)。
- 自杀/NSSI 的测试集失败率远高于真实世界部署。
- 临床医生评审发现被标注为有风险的对话中,未获得危机资源的情况零发生。
- 在 20,000 条对话中,提及 NSSI 风险的三次情况(0.015%)未触发危机干预;在由 LLM 评审标注的会话中,这对应端到端系统假阴性率为 0.38%。
- 研究结果支持将安全保障转向持续的、与部署相关的评估,而非仅依赖基准认证。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。