QUICK REVIEW

[论文解读] Empathy Applicability Modeling for General Health Queries

Shan Randhawa, Agha Ali Raza|arXiv (Cornell University)|Jan 14, 2026

Machine Learning in Healthcare被引用 0

一句话总结

该论文提出同理心适用性框架（EAF）以前瞻性地识别普通健康查询中何时以及需要哪种类型的临床同理心，并通过人类与 GPT-4o 注释展示可学习的模式。

ABSTRACT

LLMs are increasingly being integrated into clinical workflows, yet they often lack clinical empathy, an essential aspect of effective doctor-patient communication. Existing NLP frameworks focus on reactively labeling empathy in doctors' responses but offer limited support for anticipatory modeling of empathy needs, especially in general health queries. We introduce the Empathy Applicability Framework (EAF), a theory-driven approach that classifies patient queries in terms of the applicability of emotional reactions and interpretations, based on clinical, contextual, and linguistic cues. We release a benchmark of real patient queries, dual-annotated by Humans and GPT-4o. In the subset with human consensus, we also observe substantial human-GPT alignment. To validate EAF, we train classifiers on human-labeled and GPT-only annotations to predict empathy applicability, achieving strong performance and outperforming the heuristic and zero-shot LLM baselines. Error analysis highlights persistent challenges: implicit distress, clinical-severity ambiguity, and contextual hardship, underscoring the need for multi-annotator modeling, clinician-in-the-loop calibration, and culturally diverse annotation. EAF provides a framework for identifying empathy needs before response generation, establishes a benchmark for anticipatory empathy modeling, and enables supporting empathetic communication in asynchronous healthcare.

研究动机与目标

在普通健康查询中超越事后标注，激发对前瞻性同理心建模的需求。
提出同理心适用性框架（EAF），将查询按情感反应与解释划分为 Applicable/Not Applicable。
创建并分析由人类与 GPT-4o 注释的真实查询基准，以评估可靠性与对齐度。
演示可预测的同理心适用性分类模型，并与包括零-shot LLM 推理在内的基线进行比较。

提出的方法

在临床同理心文献基础上构建 EAF，区分前瞻性（响应前）分析与事后标注。
定义两个维度（情感反应、解释）及 Applicable/Not Applicable 提示和子类别。
对 HealthCareMagic 与 iCliniq 的 9,500 条查询进行注释；1,300 条双注释（人类与 GPT-4o）；8,000 条仅 GPT 注释。
在分离的带标签数据集（人工集合与自主集合）上微调基于 RoBERTa 的 EA 和 IA 分类器。
通过人-人一致性与人-GPT 对齐度评估可靠性；并与基线进行比较（随机、Always Applicable/Not、o1-Zero-Shot）。
进行消融与分歧分析，以理解主观性、临床严重性模糊性和情境难度在同理心适用性中的作用。

(a) Interpretation Applicability (IA) subcategory matches

实验结果

研究问题

RQ1EAF 能否可靠预测在患者查询中情感同理心或解释性同理心何时适用？
RQ2人类与 GPT-4o 在识别同理心适用性提示方面是否一致，是否能够提取可学习的模式？
RQ3不同训练数据（人类共识标签 vs 仅 GPT 标签）如何影响同理心适用性分类器的预测性能？
RQ4人类与模型判定之间的主要分歧来源是什么，如何通过多注释者、临床医生参与的循环方法来解决？
RQ5在临床或普通健康场景部署前瞻性同理心的实际局限性与伦理考量是什么？

主要发现

Training Set / Model	EA Acc	EA Macro-F1	EA Wtd-F1	IA Acc	IA Macro-F1	IA Wtd-F1
Random	0.47	0.47	0.47	0.44	0.43	0.44
Always Applicable	0.52	0.34	0.36	0.53	0.35	0.37
Always Not Applicable	0.48	0.32	0.31	0.47	0.32	0.30
o1 Zero-Shot	0.55	0.40	0.41	0.62	0.53	0.54
Human-supervised models (train and tested on human-consensus set) - Logistic Regression	0.84	0.84	0.84	0.80	0.80	0.80
Human-supervised models (train and tested on human-consensus set) - Linear SVM	0.83	0.83	0.83	0.77	0.77	0.77
Human-supervised models (train and tested on human-consensus set) - Transformer (RoBERTa-base)	0.92	0.92	0.92	0.87	0.87	0.87
Autonomous-supervised model (train on GPT labels, test on human-consensus test set) - Transformer (RoBERTa-base)	0.85	0.85	0.85	0.78	0.77	0.77

人类-人类的一致性中等（Cohen’s κ 约 0.46），人类-GPT 的对齐度显著（在子集的适用标签上 κ>0.6）。
基于 RoBERTa 的分类器优于基线；在人类集合上，LR/SVM 的 macro-F1 约为 0.80，而 RoBERTa-base 在 EA 约为 0.92、IA 约为 0.87。
仅 GPT 训练的模型在对人类共识测试集上达到约 0.85（EA）和 0.77（IA），表明从 GPT 标签中学习到可学习的模式。
所有基线均不如 Transformer 模型，McNemar 检验显示 transformers 显著优于简单基线和经典基线（p<10^-4 与 p≤0.02）。
分歧分析揭示挑战：隐含痛苦的主观性、临床严重性模糊性、以及情境困难，暗示需要多注释者与具文化多样性的标注。
该工作提供了 1,300 条查询的可靠 EAF 标签基准，并展示具有前瞻性同理心建模的潜在临床循环整合。

(b) Emotional Applicability (EA) subcategory matches

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。