QUICK REVIEW

[논문 리뷰] Towards a Personal Health Large Language Model

Justin Cosentino, Anastasiya Belyaeva|arXiv (Cornell University)|2024. 06. 10.

Chronic Disease Management Strategies인용 수 12

한 줄 요약

PH-LLM은 웨어러블의 시계열 개인 건강 데이터로 추론하도록 파인튜닝된 Gemini 기반 모델로 개인화된 수면 및 피트니스 인사이트를 생성하고, 전문가 성과와 벤치마크를 비교하고, 자기보고 수면 결과를 예측합니다.

ABSTRACT

In health, most large language model (LLM) research has focused on clinical tasks. However, mobile and wearable devices, which are rarely integrated into such tasks, provide rich, longitudinal data for personal health monitoring. Here we present Personal Health Large Language Model (PH-LLM), fine-tuned from Gemini for understanding and reasoning over numerical time-series personal health data. We created and curated three datasets that test 1) production of personalized insights and recommendations from sleep patterns, physical activity, and physiological responses, 2) expert domain knowledge, and 3) prediction of self-reported sleep outcomes. For the first task we designed 857 case studies in collaboration with domain experts to assess real-world scenarios in sleep and fitness. Through comprehensive evaluation of domain-specific rubrics, we observed that Gemini Ultra 1.0 and PH-LLM are not statistically different from expert performance in fitness and, while experts remain superior for sleep, fine-tuning PH-LLM provided significant improvements in using relevant domain knowledge and personalizing information for sleep insights. We evaluated PH-LLM domain knowledge using multiple choice sleep medicine and fitness examinations. PH-LLM achieved 79% on sleep and 88% on fitness, exceeding average scores from a sample of human experts. Finally, we trained PH-LLM to predict self-reported sleep quality outcomes from textual and multimodal encoding representations of wearable data, and demonstrate that multimodal encoding is required to match performance of specialized discriminative models. Although further development and evaluation are necessary in the safety-critical personal health domain, these results demonstrate both the broad knowledge and capabilities of Gemini models and the benefit of contextualizing physiological data for personal health applications as done with PH-LLM.

연구 동기 및 목표

연속적인 웨어러블 건강 데이터와 LLM의 통합을 통해 수면 및 피트니스에서 개인화된 건강 코칭을 지원하고자 한다.
시계열 센서 데이터를 해석하고 코칭 권고를 생성하기 위해 파인튜닝된 Gemini 모델인 PH-LLM를 개발하고 평가한다.
개인 건강 QA 및 가이드 tasks를 벤치마크하기 위한 새로운 데이터셋들(long-form case studies, professional exams, PRO predictions)을 만든다.
다중 모달 데이터로부터 환자 보고 수면 결과를 예측하는 능력을 평가하고 PH-LLM의 성능을 도메인 전문가와 비교한다.

제안 방법

PH-LLM를 만들기 위해_sleep과 fitness coaching_case studies의 큐레이션된 데이터셋에서 Gemini Ultra 1.0을 파인튜닝한다.
세 가지 벤치마크 데이터셋을 구성한다: long-form coaching case studies, sleep medicine 및 fitness MCQ, 웨어러블 센서 데이터로부터의 PRO 예측.
PH-LLM를 전문가 인간 채점으로 평가하고, LoRA-튜닝된 Gemini Pro 1.0 모델을 사용한 자동 평가(AutoEval)를 수행한다.
PRO 예측을 위해 인코딩된 20x2 웨어러블 피처 표현을 PH-LLM 토큰 공간으로 투사하는 MLP 어댑터를 학습하고, 텍스트 전용 프롬프트 및 로지스틱 회귀 베이스라인과 비교한다.
개인화, 데이터 사용, 지식, 안전성, 읽기 쉬움, 모델 응답의 전반적 품질 등을 평가하기 위해 전문가 루브릭을 사용한다.

Figure 1: PH-LLM: A Personal Health Large Language Model. (A) We present PH-LLM, a version of Gemini fine-tuned for personal health and wellness. We evaluated PH-LLM on three aspects of personal health: generating personalized insights and recommendations for user goals in the domains of sleep and f

실험 결과

연구 질문

RQ1PH-LLM이 장기적인 웨어러블 데이터에서 개인화된 수면 및 피트니스 인사이트와 권고를 생성할 수 있는가?
RQ2PH-LLM은 장기 코칭 사례 연구에서 도메인 전문가와, 수면 의학 및 피트니스의 전문 시험과 비교하여 어떤 차이가 있는가?
RQ3다중 모달 웨어러블 데이터 인코딩이 환자 보고 수면 결과를 예측하는 데 필요한가, 충분한가?
RQ4PH-LLM를 파인튜닝하는 것이 기본 Gemini Ultra 1.0 대비 도메인 지식과 개인화를 향상시키는가?
RQ5사례 연구 응답에 대한 전문가 평가를 예측하는 자동 평가(AutoEval)의 신뢰성은 어떤가?

주요 결과

PH-LLM은 피트니스 코칭에서 전문가 성과에 접근하고, 파인튜닝 후 수면 코칭에서 전문가와의 차이를 크게 줄인다.
다중 모달 센서 데이터 인코딩은 수면 차질 및 수면 장애 PRO를 예측하는 데 필요한 것으로, 판별 모델과의 일치를 가능하게 한다.
PH-LLM은 수면 MCQ에서 79% (N=629), 피트니스 MCQ에서 88% (N=99)를 달성하여 평균 전문가 점수 및 지속 교육 벤치마크를 상회한다.
PH-LLM의 도메인 지식과 개인화는 기본 Gemini Ultra 1.0 대비 파인튜닝으로 향상된다.
AutoEval 프레임워크는 사례 연구에 대한 인간 전문가 평가와 상관관계를 보이며 모델 선택을 안내할 수 있다.

Figure 2: Sleep case study example : wearable sensor data used as input and corresponding expert analysis and recommendations for improving sleep quality. The experts considered individual’s demographics and wearable sensor data for up to 29 days including daily metrics of (A) bedtimes and wake time

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.