QUICK REVIEW

[논문 리뷰] A Study into patient similarity through representation learning from medical records

Hoda Memarzadeh, Nasser Ghadiri|arXiv (Cornell University)|2021. 04. 29.

Machine Learning in Healthcare참고 문헌 54인용 수 7

한 줄 요약

이 논문은 UMLS-annotated 엔티티로 강화된 시간적 트리 구조를 사용하여 비구조화된 임상 노트와 구조화된 EMR 데이터를 통합하는 새로운 환자 표현 모델인 UTTree 및 UTTree-H를 제안한다. 과거 및 현재 의료 사건을 포착하기 위해 재라벨링 전략을 적용함으로써, 순서 기반 임베딩을 생성하여 환자 유사도 및 사망 예측 성능을 크게 향상시켰으며, MSE, 정밀도 및 NDCG 점수에서 기존의 기준 모델들을 능가한다.

ABSTRACT

Patient similarity assessment, which identifies patients similar to a given patient, can help improve medical care. The assessment can be performed using Electronic Medical Records (EMRs). Patient similarity measurement requires converting heterogeneous EMRs into comparable formats to calculate their distance. While versatile document representation learning methods have been developed in recent years, it is still unclear how complex EMR data should be processed to create the most useful patient representations. This study presents a new data representation method for EMRs that takes the information in clinical narratives into account. To address the limitations of previous approaches in handling complex parts of EMR data, an unsupervised method is proposed for building a patient representation, which integrates unstructured data with structured data extracted from patients' EMRs. In order to model the extracted data, we employed a tree structure that captures the temporal relations of multiple medical events from EMR. We processed clinical notes to extract symptoms, signs, and diseases using different tools such as medspaCy, MetaMap, and scispaCy and mapped entities to the Unified Medical Language System (UMLS). After creating a tree data structure, we utilized two novel relabeling methods for the non-leaf nodes of the tree to capture two temporal aspects of the extracted events. By traversing the tree, we generated a sequence that could create an embedding vector for each patient. The comprehensive evaluation of the proposed method for patient similarity and mortality prediction tasks demonstrated that our proposed model leads to lower mean squared error (MSE), higher precision, and normalized discounted cumulative gain (NDCG) relative to baselines.

연구 동기 및 목표

비구조화된 임상 노트와 구조화된 EMR 데이터를 효과적으로 통합하는 통합된 환자 표현 모델을 개발하는 것.
특히 과거 및 현재 질환과 같은 의료 사건 간의 시간적 관계를 트리 기반 데이터 구조를 사용하여 모델링하는 것.
EMR 시퀀스에서 맥락 인식형 저차원 임베딩 벡터를 생성함으로써 환자 유사도 평가 및 사망 예측을 향상시키는 것.
실제 데이터셋을 대상으로 제안된 방법을 기존 기준 모델들과 비교하여 주요 후행 지표에서의 성능 향상을 입증하는 것.

제안 방법

논문은 NLP 도구(예: medspaCy, MetaMap, scispaCy)를 통해 추출된 의료 엔티티(예: 질환, 증상)를 노드로 하는 EMR 이벤트에서 트리 데이터 구조를 구성한다.
추출된 엔티티는 의미 일관성과 표준화를 확보하기 위해 통합 의료 언어 체계(UMLS)로 매핑된다.
비엽노드에 대해 두 가지 새로운 재라벨링 전략을 적용하여 의료 사건의 동시 발생 및 시간적 순서를 인코딩한다.
트리를 순회하여 시간적 의존성을 유지하는 순서 기반 표현을 생성하고, 이를 PV-DM과 같은 표현 학습 모델의 입력으로 사용한다.
향상된 UTTree-H 버전은 과거 병력 존재 여부에 따라 노드 레이블을 조정하여 과거 의료 기록을 명시적으로 통합한다.
차원 축소(PCA)와 후행 분류기(XGBoost, SVM, 랜덤 포레스트)를 적용하여 임베딩 품질을 사망 예측 과제에서 평가한다.

실험 결과

연구 질문

RQ1비구조화된 EMR 데이터와 구조화된 EMR 데이터를 하나의 일관된 환자 표현으로 효과적으로 통합할 수 있는가?
RQ2의료 사건 간의 시간적 관계를 모델링하는 것이 환자 유사도 및 예측 성능에 미치는 영향은 무엇인가?
RQ3대부분의 모델들이 과거 맥락을 忽시하는 것과는 대비하여, 표현에 과거 의료 기록을 통합하는 것이 후행 과제 정확도를 향상시키는가?
RQ4트리 노드에 적용된 제안된 재라벨링 전략이 생성된 임베딩 시퀀스의 품질에 어떤 영향을 미치는가?

주요 결과

UTTree-H 모델은 모든 기준 모델 대비 환자 유사도 과제에서 더 낮은 평균 제곱 오차(MSE)를 기록하였으며, 통계적으로 유의미한 결과(p < 0.01)를 보였다.
모델은 환자 유사도 순위 매기기에서 더 높은 정밀도와 정규화된 누적 할인 수익(NDCG)을 보였으며, 이는 검색 품질 향상을 시사한다.
환자 기록에서 8개 이상의 생물의학적 개념이 추출된 경우, UTTree-H 모델은 오차 감소율에서 다른 접근 방식을 일관되게 능가하였다.
제안된 임베딩에 기반한 XGBoost 분류기는 모든 데이터셋에서 가장 높은 중앙값 정확도를 기록하였으며, 박스 플롯에서 더 좁은 사분자구간을 보이며 뛰어난 안정성과 성능을 입증하였다.
와일콕 순서합검정(Wilcoxon signed-rank test)은 UTTree와 기준 모델 간 성능 차이가 통계적으로 유의미했음을 확인하였으며(p < 0.01), 별표(*)로 표시된 하나의 비교를 제외한 모든 경우에서 그러한 유의미성이 확인되었다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.