QUICK REVIEW

[論文レビュー] Towards Automated ICD Coding Using Deep Learning

Haoran Shi, Pengtao Xie|arXiv (Cornell University)|Nov 11, 2017

Biomedical Text Mining and Ontologies参考文献 26被引用数 127

ひとこと要約

本論文は診断記述から ICD コードを自動割り当てするための注意機構付き階層的深層学習モデルを提示し、MIMIC-III データ上で Soft-attention の F1 が 0.532、AUC-ROC が 0.900 を達成する。

ABSTRACT

International Classification of Diseases(ICD) is an authoritative health care classification system of different diseases and conditions for clinical and management purposes. Considering the complicated and dedicated process to assign correct codes to each patient admission based on overall diagnosis, we propose a hierarchical deep learning model with attention mechanism which can automatically assign ICD diagnostic codes given written diagnosis. We utilize character-aware neural language models to generate hidden representations of written diagnosis descriptions and ICD codes, and design an attention mechanism to address the mismatch between the numbers of descriptions and corresponding codes. Our experimental results show the strong potential of automated ICD coding from diagnosis descriptions. Our best model achieves 0.53 and 0.90 of F1 score and area under curve of receiver operating characteristic respectively. The result outperforms those achieved using character-unaware encoding method or without attention mechanism. It indicates that our proposed deep learning model can code automatically in a reasonable way and provide a framework for computer-auxiliary ICD coding.

研究の動機と目的

医療現場におけるコーディングエラーとコストを削減するための自動 ICD コーディングの動機づけ。
診断記述を対象として ICD コーディングをマルチラベル分類問題として定式化する。
診断テキストと ICD コード定義との文体的ギャップを埋める神経アーキテクチャを開発する。
注意機構が診断記述と ICD コードの整合性を改善するかを評価する。

提案手法

診断記述を文字レベルおよび語彙レベルの LSTM ネットワークを用いてエンコードし、隠れ表現を得る。
ICDコード定義（長いタイトル）を並列の文字レベルおよび語レベルの LSTM でエンコードし、コード表現を得る。
隠れ状態のコサイン類似度を用いて、各 ICD コードと診断記述間のアテンションスコアを計算する。
ソフトアテンションを適用して診断記述をコード特異的なベクトルに集約し、次にシグモイド出力層を介して確率に射影する。
Adam 最適化アルゴリズムで二値クロスエントロピー損失を用いて学習し、検証データで最高の F1 を得るために閾値を調整する。

実験結果

リサーチクエスチョン

RQ1階層的ニューラルモデルと注意機構は、自由テキストの診断記述を複数の ICD コードへ効果的に対応づけられるか。
RQ2診断記述と ICD コード定義の整合性において、ソフトアテンションはハードセレクションより優れているか。
RQ3文字レベルのエンコーダは医療用語や誤字の頑健な表現にどのように寄与するか。
RQ4MIMIC-III のようなデータセットにおける ICD-9 コード定義の使用がコーディング性能に与える影響は何か。

主な発見

Model	F1	AUC_ROC
Hard-selection Model	0.480	0.877
Soft-attention Model	0.532	0.900
Ablation: Random word embedding	0.508	0.882
Ablation: Pre-trained word embedding	0.528	0.895
Ablation: Average encoder	0.504	0.886
Ablation: No attention (linear classifier)	0.471	0.882

ソフトアテンションにより F1 が 0.532、AUC-ROC が 0.900 に改善され、ハードセレクションモデルを上回る。
ハードセレクションは F1 が 0.480、AUC-ROC が 0.877 を示す。
アブレーション研究により、文字レベルのエンコードとアテンションの両方が性能にとって重要であることが示される。
文字レベル LSTM をランダムや非文字エンコーダに置換すると F1 と AUC-ROC が低下する。
事前学習済み語 embeddings の使用は有効だが、本設定では文字レベルのエンコーダを上回らない。
アテンションの可視化は、異なる ICD コードごとに診断記述に対する関心が異なることを示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。