QUICK REVIEW

[论文解读] A machine learning methodology for real-time forecasting of the 2019-2020 COVID-19 outbreak using Internet searches, news alerts, and estimates from mechanistic models

Dianbo Liu, Leonardo Clemente|PubMed|Apr 8, 2020

COVID-19 epidemiological studies参考文献 35被引用 102

一句话总结

本论文介绍 Augmented ARGONet，一种实时向前预测两天的框架，结合 China CDC 报告、Baidu 搜索、Media Cloud 新闻和 GLEAM 机制模型输出，在中国省级层面预测 COVID-19 活动，使用聚类和数据增强。

ABSTRACT

We present a timely and novel methodology that combines disease estimates from mechanistic models with digital traces, via interpretable machine-learning methodologies, to reliably forecast COVID-19 activity in Chinese provinces in real-time. Specifically, our method is able to produce stable and accurate forecasts 2 days ahead of current time, and uses as inputs (a) official health reports from Chinese Center Disease for Control and Prevention (China CDC), (b) COVID-19-related internet search activity from Baidu, (c) news media activity reported by Media Cloud, and (d) daily forecasts of COVID-19 activity from GLEAM, an agent-based mechanistic model. Our machine-learning methodology uses a clustering technique that enables the exploitation of geo-spatial synchronicities of COVID-19 activity across Chinese provinces, and a data augmentation technique to deal with the small number of historical disease activity observations, characteristic of emerging outbreaks. Our model's predictive power outperforms a collection of baseline models in 27 out of the 32 Chinese provinces, and could be easily extended to other geographies currently affected by the COVID-19 outbreak to help decision makers.

研究动机与目标

动员在历史数据稀缺的新兴疫情中进行实时预测。
开发一个数据驱动的、具地理空间感知的模型，利用多种数据流。
通过聚类和数据增强来缓解数据稀缺，以训练省-聚类预测模型。
评估将机制模型估计纳入数据驱动预测框架的附加价值。

提出的方法

基于地时 COVID-19 模式创建省份聚类，并在每个预测日期重新训练模型。
通过带高斯噪声的自助抽样实现数据增强，以扩大量化每个聚类的训练数据。
拟合一个 LASSO 多变量线性模型，使用来自过去病例、Baidu 搜索、Media Cloud 文章、死亡和累计病例的输入来预测未来2天的病例数。
将机制模型估计（GLEAM）作为 Augmented ARGONet 的输入，在聚类与增强之前。
与基线持久性、自回归及不含机制输入的 ARGONet 进行比较，以评估预测增益。

实验结果

研究问题

RQ1多源数据（官方报告、互联网搜索、新闻和机制预测）是否可以在省级近实时预测 COVID-19 活动？
RQ2在历史观测有限的情况下，聚类和数据增强是否能提升预测性能？
RQ3在数据驱动预测框架中包含机制模型估计的增量价值是什么？

主要发现

Augmented ARGONet 在 32 个中国省份中有 27 个在两天前预测方面优于持久性基线。
相对基线，带聚类和增强的 ARGONet 在 32 个省份中有 25 个改善 RMSE，在 18 个省份提高相关性。
将机制模型估计纳入可提高大多数省份的预测能力。
部分地区（台湾、香港、广西、山西、辽宁）未显示 RMSE 改善，可能原因是不同的行政/卫生系统。
仅使用本省数据的模型通常不优于基线，而 ARGO 型模型则显示出有限的改进。
总体而言，该方法在数据有限的情况下的突发疫情中展示了实时预测能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。