QUICK REVIEW

[论文解读] MIMIC-Extract: A Data Extraction, Preprocessing, and Representation Pipeline for MIMIC-III

Shirly Wang, Matthew B. A. McDermott|arXiv (Cornell University)|Jul 19, 2019

Machine Learning in Healthcare参考文献 20被引用 42

一句话总结

MIMIC-Extract 提供一个开源管道，用于提取、预处理，并将 MIMIC-III EHR 数据表示为可直接用于机器学习模型评估的时间序列特征、干预和结果，用于基准测试。它强调鲁棒性、可重复性和可扩展性，适用于 ICU 预测任务。

ABSTRACT

Robust machine learning relies on access to data that can be used with standardized frameworks in important tasks and the ability to develop models whose performance can be reasonably reproduced. In machine learning for healthcare, the community faces reproducibility challenges due to a lack of publicly accessible data and a lack of standardized data processing frameworks. We present MIMIC-Extract, an open-source pipeline for transforming raw electronic health record (EHR) data for critical care patients contained in the publicly-available MIMIC-III database into dataframes that are directly usable in common machine learning pipelines. MIMIC-Extract addresses three primary challenges in making complex health records data accessible to the broader machine learning community. First, it provides standardized data processing functions, including unit conversion, outlier detection, and aggregating semantically equivalent features, thus accounting for duplication and reducing missingness. Second, it preserves the time series nature of clinical data and can be easily integrated into clinically actionable prediction tasks in machine learning for health. Finally, it is highly extensible so that other researchers with related questions can easily use the same pipeline. We demonstrate the utility of this pipeline by showcasing several benchmark tasks and baseline results.

研究动机与目标

提供一个鲁棒、可重复的管道，将原始 MIMIC-III EHR 数据转换为可用于机器学习的时间序列格式

提出的方法

队列提取，聚焦于第一次成人 ICU 住院（年龄 ≥15，持续时间 12h–<10d）
使用临床知情阈值的单位标准化和离群值处理
将生命体征和实验室检查按小时聚合为时间序列特征，并进行临床聚合以降低缺失率
提取每小时干预信号（通气、升压药、液体治疗）和静态结局
两种特征表示：基于原始 ItemID 的特征和临床聚合后的特征
可扩展设计，采用关键词驱动的配置、资源文件和嵌入式 SQL 以支持定制

实验结果

研究问题

RQ1如何将 MIMIC-III 数据转换为标准化、鲁棒的按小时时间序列，以适用于预测任务？
RQ2通用、可重复的数据管道是否能提升不同研究之间的可比性和用于 MIMIC-III 的 ICU 机器学习模型的基准测试？
RQ3临床聚合、单位换算和离群值处理对模型对时间漂移鲁棒性的影响？
RQ4基于提取的数据，哪些预测任务（死亡率、 LOS、按小时干预）是可行的，基线模型的表现如何？

主要发现

该管道输出一个 34,472 名患者的默认队列，包含静态数据和随时间变化的数据，适用于多项基准测试
提供两种输出格式：原始 Item-level 特征和临床聚合特征，以提升鲁棒性
应用离群值检测和单位换算，且以临床知情阈值指导数据清洗
将通气、升压药和液体治疗的每小时干预作为时间变化信号包含在内
基准任务包括死亡率和 LOS 预测，以及使用多种模型（LR、RF、GRU-D）对每小时干预的起始/结束进行预测
GRU-D 和 RF 在多数任务中通常实现较强的 AUROC/AUPRC，F1 和准确率模式各异，表明任务特定的模型优势

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。