Skip to main content
QUICK REVIEW

[论文解读] Towards a Personal Health Large Language Model

Justin Cosentino, Anastasiya Belyaeva|arXiv (Cornell University)|Jun 10, 2024
Chronic Disease Management Strategies被引用 12
一句话总结

PH-LLM 是一个基于 Gemini 的模型,经过微调以对来自可穿戴设备的时间序列个人健康数据进行推理,从而生成个性化的睡眠和健身洞察,与专家表现进行基准比较,并预测自我报告的睡眠结果。

ABSTRACT

In health, most large language model (LLM) research has focused on clinical tasks. However, mobile and wearable devices, which are rarely integrated into such tasks, provide rich, longitudinal data for personal health monitoring. Here we present Personal Health Large Language Model (PH-LLM), fine-tuned from Gemini for understanding and reasoning over numerical time-series personal health data. We created and curated three datasets that test 1) production of personalized insights and recommendations from sleep patterns, physical activity, and physiological responses, 2) expert domain knowledge, and 3) prediction of self-reported sleep outcomes. For the first task we designed 857 case studies in collaboration with domain experts to assess real-world scenarios in sleep and fitness. Through comprehensive evaluation of domain-specific rubrics, we observed that Gemini Ultra 1.0 and PH-LLM are not statistically different from expert performance in fitness and, while experts remain superior for sleep, fine-tuning PH-LLM provided significant improvements in using relevant domain knowledge and personalizing information for sleep insights. We evaluated PH-LLM domain knowledge using multiple choice sleep medicine and fitness examinations. PH-LLM achieved 79% on sleep and 88% on fitness, exceeding average scores from a sample of human experts. Finally, we trained PH-LLM to predict self-reported sleep quality outcomes from textual and multimodal encoding representations of wearable data, and demonstrate that multimodal encoding is required to match performance of specialized discriminative models. Although further development and evaluation are necessary in the safety-critical personal health domain, these results demonstrate both the broad knowledge and capabilities of Gemini models and the benefit of contextualizing physiological data for personal health applications as done with PH-LLM.

研究动机与目标

  • 促进将持续的可穿戴健康数据与LLMs结合,以支持在睡眠和健身领域的个性化健康辅导。
  • 开发并评估 PH-LLM,一个对 Gemini 模型进行微调的版本,用于解释时间序列传感数据并生成教练建议。
  • 创建新数据集(长篇病例研究、专业考试、PRO 预测),以基准个人健康问答与指导任务。
  • 评估 PH-LLM 在领域专家面前的表现,并确立其从多模态数据预测患者自报睡眠结果的能力。

提出的方法

  • 在经过筛选的睡眠与健身教练案例数据集上对 Gemini Ultra 1.0 进行微调,以创建 PH-LLM。
  • 构建三个基准数据集:长篇教练病例研究、睡眠医学和健身多项选择题,以及来自可穿戴传感数据的 PRO 预测。
  • 使用专家人工评分对 PH-LLM 的长篇回答进行评估,同时使用 LoRA 微调的 Gemini Pro 1.0 模型进行自动评估(AutoEval)。
  • 训练一个 MLP 适配器,将编码后的 20x2 可穿戴特征表示投射到 PH-LLM 的令牌空间以进行 PRO 预测,并与文本提示和逻辑回归基线进行比较。
  • 使用专家评分标准评估个性化、数据使用、知识、安全、可读性和模型回答的整体质量。
Figure 1: PH-LLM: A Personal Health Large Language Model. (A) We present PH-LLM, a version of Gemini fine-tuned for personal health and wellness. We evaluated PH-LLM on three aspects of personal health: generating personalized insights and recommendations for user goals in the domains of sleep and f
Figure 1: PH-LLM: A Personal Health Large Language Model. (A) We present PH-LLM, a version of Gemini fine-tuned for personal health and wellness. We evaluated PH-LLM on three aspects of personal health: generating personalized insights and recommendations for user goals in the domains of sleep and f

实验结果

研究问题

  • RQ1PH-LLM 能否从纵向可穿戴数据中生成个性化的睡眠与健身洞察及建议?
  • RQ2PH-LLM 在长篇教练病例研究中与领域专家以及在睡眠医学和健身领域的专业考试相比如何?
  • RQ3多模态可穿戴数据编码是否既必要又充分用于预测患者自报的睡眠结果?
  • RQ4与基线 Gemini Ultra 1.0 相比,微调 PH-LLM 是否提升了其对领域知识和个性化的使用?
  • RQ5自动评估(AutoEval)在预测案例研究回应的专家评分方面的可靠性如何?

主要发现

  • PH-LLM 在健身教练方面接近专家水平,在微调后在睡眠教练方面也缩小了与专家的差距。
  • 多模态传感数据编码是必要且充分的,可以与判别模型在预测睡眠障碍和睡眠受损 PROs 上相匹配。
  • PH-LLM 在睡眠 MCQ(N=629)获得 79%,在健身 MCQ(N=99)获得 88%,超过平均专家分数和继续教育基准。
  • 相比基础的 Gemini Ultra 1.0,PH-LLM 的领域知识和个性化在微调后有所提升。
  • AutoEval 框架可以指导模型选择,与案例研究的人工专家评分相关。
Figure 2: Sleep case study example : wearable sensor data used as input and corresponding expert analysis and recommendations for improving sleep quality. The experts considered individual’s demographics and wearable sensor data for up to 29 days including daily metrics of (A) bedtimes and wake time
Figure 2: Sleep case study example : wearable sensor data used as input and corresponding expert analysis and recommendations for improving sleep quality. The experts considered individual’s demographics and wearable sensor data for up to 29 days including daily metrics of (A) bedtimes and wake time

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。