QUICK REVIEW

[論文レビュー] Empathy Applicability Modeling for General Health Queries

Shan Randhawa, Agha Ali Raza|arXiv (Cornell University)|Jan 14, 2026

Machine Learning in Healthcare被引用数 0

ひとこと要約

The paper introduces the Empathy Applicability Framework (EAF) to prospectively identify when and what type of clinical empathy is needed in general health queries, and demonstrates learnable patterns via human and GPT-4o annotations.

ABSTRACT

LLMs are increasingly being integrated into clinical workflows, yet they often lack clinical empathy, an essential aspect of effective doctor-patient communication. Existing NLP frameworks focus on reactively labeling empathy in doctors' responses but offer limited support for anticipatory modeling of empathy needs, especially in general health queries. We introduce the Empathy Applicability Framework (EAF), a theory-driven approach that classifies patient queries in terms of the applicability of emotional reactions and interpretations, based on clinical, contextual, and linguistic cues. We release a benchmark of real patient queries, dual-annotated by Humans and GPT-4o. In the subset with human consensus, we also observe substantial human-GPT alignment. To validate EAF, we train classifiers on human-labeled and GPT-only annotations to predict empathy applicability, achieving strong performance and outperforming the heuristic and zero-shot LLM baselines. Error analysis highlights persistent challenges: implicit distress, clinical-severity ambiguity, and contextual hardship, underscoring the need for multi-annotator modeling, clinician-in-the-loop calibration, and culturally diverse annotation. EAF provides a framework for identifying empathy needs before response generation, establishes a benchmark for anticipatory empathy modeling, and enables supporting empathetic communication in asynchronous healthcare.

研究の動機と目的

Motivate the need for anticipatory empathy modeling in general health queries beyond post-hoc labeling.
Propose the Empathy Applicability Framework (EAF) to classify queries by emotional reactions and interpretations as Applicable/Not Applicable.
Create and analyze a benchmark of real queries annotated by humans and GPT-4o to evaluate reliability and alignments.
Demonstrate predictive models that classify empathy applicability and compare against baselines including zero-shot LLM reasoning.

提案手法

Develop EAF grounded in clinical empathy literature, distinguishing anticipatory (pre-response) analysis from post-hoc labeling.
Define two dimensions (Emotional Reactions, Interpretations) with Applicable/Not Applicable cues and subcategories.
Annotate 9,500 queries from HealthCareMagic and iCliniq; 1,300Quer ies dual-annotated by humans and GPT-4o; 8,000 GPT-only annotations.
Fine-tune RoBERTa-based classifiers for EA and IA on separate labeled datasets (Human Set and Autonomous Set).
Evaluate reliability via human-human agreement and human-GPT alignment; compare against baselines (random, Always Applicable/Not, o1-Zero-Shot).
Perform ablation and divergence analysis to understand subjectivity, clinical-severity ambiguity, and contextual hardship in empathy applicability.]

(a) Interpretation Applicability (IA) subcategory matches

実験結果

リサーチクエスチョン

RQ1Can EAF reliably predict when emotional or interpretive empathy is applicable to a patient query?
RQ2Do humans and GPT-4o align in identifying empathy applicability cues, and can learnable patterns be extracted?
RQ3How do different training data (human consensus vs GPT-only labels) affect predictive performance of empathy applicability classifiers?
RQ4What are the main sources of divergence between human and model judgments, and how can multi-annotator, clinician-in-the-loop approaches address them?
RQ5What are the practical limitations and ethical considerations for deploying anticipatory empathy in clinical or general-health settings?

主な発見

Training Set / Model	EA Acc	EA Macro-F1	EA Wtd-F1	IA Acc	IA Macro-F1	IA Wtd-F1
Random	0.47	0.47	0.47	0.44	0.43	0.44
Always Applicable	0.52	0.34	0.36	0.53	0.35	0.37
Always Not Applicable	0.48	0.32	0.31	0.47	0.32	0.30
o1 Zero-Shot	0.55	0.40	0.41	0.62	0.53	0.54
Human-supervised models (train and tested on human-consensus set) - Logistic Regression	0.84	0.84	0.84	0.80	0.80	0.80
Human-supervised models (train and tested on human-consensus set) - Linear SVM	0.83	0.83	0.83	0.77	0.77	0.77
Human-supervised models (train and tested on human-consensus set) - Transformer (RoBERTa-base)	0.92	0.92	0.92	0.87	0.87	0.87
Autonomous-supervised model (train on GPT labels, test on human-consensus test set) - Transformer (RoBERTa-base)	0.85	0.85	0.85	0.78	0.77	0.77

Moderate human–human agreement (Cohen’s kappa ~0.46) and substantial human–GPT alignment (kappa >0.6 on applicable labels on subset).
RoBERTa-based classifiers outperform baselines; on the Human Set, LR/SVM reach ~0.80 macro-F1, while RoBERTa-base achieves ~0.92 macro-F1 for EA and ~0.87 for IA.
GPT-only trained models achieve ~0.85 (EA) and ~0.77 (IA) on the human-consensus test set, indicating learnable patterns from GPT labels.
All baselines underperform compared with transformer models; McNemar tests show transformers significantly better than trivial baselines and classical baselines (p<10^-4 and p≤0.02).
Divergence analysis highlights challenges: subjectivity in implied distress, clinical-severity ambiguity, and contextual hardship, suggesting need for multi-annotator and culturally diverse annotation.
The work provides a benchmark of 1,300 queries with reliable EAF labels and demonstrates anticipatory empathy modeling with potential clinical-in-the-loop integration.

(b) Emotional Applicability (EA) subcategory matches

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。