QUICK REVIEW

[論文レビュー] Local calibration of verbal autopsy algorithms

Abhirup Datta, Jacob Fiksel|arXiv (Cornell University)|Oct 24, 2018

Machine Learning in Healthcare参考文献 10被引用数 1

ひとこと要約

本論文は、母集団レベルの原因別死亡率分率推定値を向上させるために、言語的自動診断アルゴリズムの局所的キャリブレーションを目的とした階層ベイズ的転移学習フレームワークを提案する。縮小事前分布と新規のギブスサンプラーを組み合わせることで、局所データが利用できない場合でも、キャリブレーションされた推定値がベースライン分類器と整合するよう保証し、小標本設定下で非局所学習を上回る性能を発揮する。

ABSTRACT

Computer-coded verbal autopsy (CCVA) algorithms predict cause of death from high-dimensional family questionnaire data (verbal autopsies) of a deceased individual. CCVA algorithms are typically trained on non-local data, then used to generate national and regional estimates of cause-specific mortality fractions. These estimates may be inaccurate if the non-local training data is different from the local population of interest. This problem is a special case of transfer learning. However, most transfer learning classification approaches are concerned with individual (e.g. a person's) classification within a target domain (e.g. a particular population) with training performed in data from a source domain. Epidemiologists are often more interested in estimating population-level etiological distributions, using datasets much smaller than those used in common transfer learning applications. We present a parsimonious hierarchical Bayesian transfer learning framework to directly estimate population-level class probabilities in a target domain. To address small sample sizes, we introduce a novel shrinkage prior for the transfer error rates guaranteeing that, in absence of any labeled target domain data or when the baseline classifier has zero transfer error, the calibrated estimate of class probabilities coincides with the naive estimates from the baseline classifier, thereby subsuming the default practice as a special case. A novel Gibbs sampler using data-augmentation enables fast implementation. We extend our approach to use not one, but an ensemble of baseline classifiers. Theoretical and empirical results demonstrate how the ensemble model favors the most accurate baseline classifier. We present extensions allowing class probabilities to vary with covariates, and an EM-algorithm-based MAP estimation. An R-package implementing this method is developed.

研究の動機と目的

非局所的な言語的自動診断アルゴリズムを用いる際の原因別死亡率分率推定値の不正確さを是正すること。
限られた局所データを用いて母集団レベルのクラス確率をキャリブレーションする手法を開発すること。
局所ラベルが利用できない場合、キャリブレーションされた推定値がベースライン分類器の出力にデフォルトとして一致することを保証すること。
フレームワークを、より高い耐性を発揮するためのベースライン分類器のアンサンブルを用いるように拡張すること。
クラス確率を共変数に依存させる仕組みを導入し、EMアルゴリズムを用いたMAP推定を可能とすること。

提案手法

ターゲットドメインにおける母集団レベルの原因別死亡率分率を推定するための階層ベイズ的転移学習モデルを提案する。
局所データが存在しない場合に、キャリブレーションがベースライン分類器にデフォルトとして一致するよう保証する、新規の縮小事前分布を導入する。
高次元の入力に対しても効率的な事後分布計算を可能にする、データ拡張に基づくギブスサンプラーを採用する。
ベースライン分類器のアンサンブルを用いるフレームワークを拡張し、事後加重により最も正確な分類器を優遇する。
共変数に依存するクラス確率を組み込み、モデルの柔軟性を向上させる。
スケーラブルな推論を可能とするEMアルゴリズムに基づくMAP推定手順を開発する。

実験結果

リサーチクエスチョン

RQ1訓練データが非局所的である場合、転移学習フレームワークは原因別死亡率分率推定値の正確性を向上させることができるか？
RQ2局所データが不足または存在しない場合、どのようにキャリブレーションを保証できるか？
RQ3ベースライン分類器のアンサンブルを用いることで、単一の分類器を用いる場合と比較して推定性能が向上するか？
RQ4本手法は、原因別死亡分布の共変数に依存する変動に適応可能か？
RQ5縮小事前分布は、局所ラベルが存在しない状況で、なぜベースライン分類器と一貫性を保つのか？

主な発見

縮小事前分布により、局所データが存在しない場合にキャリブレーション推定値がベースライン分類器と一致することが保証され、従来の慣習が特別なケースとして保存される。
ギブスサンプラーにより、高次元の言語的自動診断データに対しても高速かつスケーラブルな事後分布計算が可能である。
実証的結果から、本手法は特にベースライン分類器が不完全な場合に、非局所学習を上回る性能を示す。
アンサンブルモデルは、最も正確なベースライン分類器を優遇するため、劣悪なソースモデルへの耐性が向上する。
共変数に依存する確率への拡張により、より洗練された文脈特化型の死亡率分率推定が可能になる。
本手法を実装したRパッケージが開発され、疫学研究者が実務的に活用できるようになる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。