QUICK REVIEW

[논문 리뷰] Towards Automated ICD Coding Using Deep Learning

Haoran Shi, Pengtao Xie|arXiv (Cornell University)|2017. 11. 11.

Biomedical Text Mining and Ontologies참고 문헌 26인용 수 127

한 줄 요약

이 논문은 진단 설명으로부터 ICD 코드를 자동으로 할당하기 위해 주의(attention)을 갖춘 계층적 딥 러닝 모델을 제시하며, MIMIC-III 데이터에서 Soft-attention F1은 0.532, AUC-ROC는 0.900을 달성했다.

ABSTRACT

International Classification of Diseases(ICD) is an authoritative health care classification system of different diseases and conditions for clinical and management purposes. Considering the complicated and dedicated process to assign correct codes to each patient admission based on overall diagnosis, we propose a hierarchical deep learning model with attention mechanism which can automatically assign ICD diagnostic codes given written diagnosis. We utilize character-aware neural language models to generate hidden representations of written diagnosis descriptions and ICD codes, and design an attention mechanism to address the mismatch between the numbers of descriptions and corresponding codes. Our experimental results show the strong potential of automated ICD coding from diagnosis descriptions. Our best model achieves 0.53 and 0.90 of F1 score and area under curve of receiver operating characteristic respectively. The result outperforms those achieved using character-unaware encoding method or without attention mechanism. It indicates that our proposed deep learning model can code automatically in a reasonable way and provide a framework for computer-auxiliary ICD coding.

연구 동기 및 목표

의료 분야에서 코딩 오류와 비용을 줄이기 위해 자동 ICD 코딩을 촉진한다.
진단 설명을 다중 라벨 분류 문제로 ICD 코딩을 형식화한다.
진단 텍스트와 ICD 코드 정의 간의 문체 차이를 연결하는 신경망 아키텍처를 개발한다.
어텐션 메커니즘이 진단 설명과 ICD 코드 간의 정렬을 개선하는지 평가한다.

제안 방법

진단 설명을 문자-레벨과 단어-레벨 LSTM 네트워크를 사용하여 은닉 표현을 얻는다.
ICD 코드 정의(긴 제목)를 병행 문자- 및 단어-레벨 LSTM으로 인코딩하여 코드 표현을 얻는다.
은닉 상태의 코사인 유사도를 사용하여 각 ICD 코드와 진단 설명 간의 어텐션 점수를 계산한다.
소프트 어텐션을 적용하여 진단 설명을 코드 특화 벡터로 집계하고, 그 후 시그모이드 출력 레이어를 통해 확률로 투영한다.
검증 데이터에서 최적의 F1을 얻기 위해 임계치를 조정하는 Adam 옵티마이저를 사용한 이진 크로스 엔트로피 손실로 학습한다.

실험 결과

연구 질문

RQ1계층적 신경 모델이 주의(attention)로 자유 텍스트 진단 설명을 여러 ICD 코드로 효과적으로 매핑할 수 있는가?
RQ2소프트 어텐션이 진단 설명과 ICD 코드 정의 간 정렬에서 하드셀렉션보다 우수한가?
RQ3문자-레벨 인코더가 의료 용어와 오타에 대한 강건한 표현에 어떻게 기여하는가?
RQ4ICD-9 코드 정의를 사용하는 것이 MIMIC-III 같은 데이터셋에서 코딩 성능에 미치는 영향은 무엇인가?

주요 결과

모델	F1	AUC_ROC
Hard-selection Model	0.480	0.877
Soft-attention Model	0.532	0.900
Ablation: Random word embedding	0.508	0.882
Ablation: Pre-trained word embedding	0.528	0.895
Ablation: Average encoder	0.504	0.886
Ablation: No attention (linear classifier)	0.471	0.882

소프트 어텐션이 F1을 0.532로, AUC-ROC를 0.900으로 향상시켜 하드 셀렉션 모델을 능가한다.
하드-셀렉션은 F1이 0.480, AUC-ROC가 0.877이다.
설계요소 제거 연구에서 문자-레벨 인코딩과 어텐션이 성능에 모두 결정적임이 나타난다.
문자-레벨 LSTM을 무작위 인코더나 비문자 인코더로 교체하면 F1 및 AUC-ROC가 저하된다.
사전 학습된 단어 임베딩을 사용하면 도움이 되지만 이 설정에서 문자-레벨 인코더를 능가하지 못한다.
어텐션 시각화는 서로 다른 ICD 코드에 대해 진단 설명에서 다양한 집중도를 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.