QUICK REVIEW

[論文レビュー] A Systematic Comparison of Bayesian Deep Learning Robustness in Diabetic Retinopathy Tasks

Angelos Filos, Sebastian Farquhar|arXiv (Cornell University)|Dec 22, 2019

Anomaly Detection Techniques and Applications参考文献 47被引用数 73

ひとこと要約

本論文は糖尿病網膜症の実世界ベイジアン深層学習ベンチマークを提案し、 uncertainty に基づくリファラル課題で複数のBDL手法（MCドロップアウト、MFVI、Deep Ensembles など）を系統的に比較します。エンサンブル法は MFVI をしばしば上回り、UCI ベンチマークが手法のランキングを誤らせる可能性があることを示しています。

ABSTRACT

Evaluation of Bayesian deep learning (BDL) methods is challenging. We often seek to evaluate the methods' robustness and scalability, assessing whether new tools give `better' uncertainty estimates than old ones. These evaluations are paramount for practitioners when choosing BDL tools on-top of which they build their applications. Current popular evaluations of BDL methods, such as the UCI experiments, are lacking: Methods that excel with these experiments often fail when used in application such as medical or automotive, suggesting a pertinent need for new benchmarks in the field. We propose a new BDL benchmark with a diverse set of tasks, inspired by a real-world medical imaging application on \emph{diabetic retinopathy diagnosis}. Visual inputs (512x512 RGB images of retinas) are considered, where model uncertainty is used for medical pre-screening---i.e. to refer patients to an expert when model diagnosis is uncertain. Methods are then ranked according to metrics derived from expert-domain to reflect real-world use of model uncertainty in automated diagnosis. We develop multiple tasks that fall under this application, including out-of-distribution detection and robustness to distribution shift. We then perform a systematic comparison of well-tuned BDL techniques on the various tasks. From our comparison we conclude that some current techniques which solve benchmarks such as UCI `overfit' their uncertainty to the dataset---when evaluated on our benchmark these underperform in comparison to simpler baselines. The code for the benchmark, its baselines, and a simple API for evaluating new BDL tools are made available at https://github.com/oatml/bdl-benchmarks.

研究の動機と目的

現実的なBDLベンチマークの必要性を動機づけ、UCI のような toy データセットを超えた医療診断における不確実性が専門家へのリファラルを誘発する点に着目する。
実世界の制約下での不確実性を評価する下流リファラル課題を再現する糖尿病網膜症ベンチマークを開発する。
分布内・分布シフトの両方のシナリオで整ったベイジアン深層学習技術を比較し、スケーラビリティと信頼性を評価する。
新規のBDLツールの迅速な開発を促進するオープンソースのベンチマークコードと評価APIを提供する。

提案手法

Kaggle DR Detectionデータを用いて糖尿病網膜症ベンチマークを構築し、120,701件のうち訓練35,126件・テスト53,576件を binary の“視機能障害 DR”タスクに再構成する。
画像を512x512に前処理し、カラーチャンネル、正規化、アフィンデータ拡張を適用する。
uncertainty閾値を用いたリファラルベースの下流タスクを定義し、専門家へのリファラルとリソース配分を模擬する。
予測エントロピーで不確実性を評価し、データ保持率（リファラル率）に対して手法を比較する。
基準法として、モンテカルロドロップアウト（MC）、平均場変分推論（MFVI）、Deep Ensembles、Deterministic baseline、およびEnsemble MC Dropoutを実装・調整する。
各手法について統計的安定性を確保するため9つの seeds で訓練し、Kaggleの分布内データと out-of-distribution のAPTOS 2019データで比較する。

実験結果

リサーチクエスチョン

RQ1さまざまなベイジアン深層学習技術は医療DR診断タスクで予測不確実性をどうキャリブレーションし活用するか？
RQ2不確実性を意識した手法は、データの一部が専門家へリファーされた場合、パフォーマンスを維持するか（異なるリファラル率の下で）？
RQ3人気のBDLベンチマーク（例：UCI）は、分布シフトを伴う現実的な医療ベンチマークでの性能にどのように翻訳されるか？
RQ4どの手法が外部分布の医療画像データへ対してスケールし、一般化するか？

主な発見

エンサンブルに基づく手法とMCドロップアウトは、MFVIおよび決定論的ベースラインよりもリファラル率が上がるにつれて、AUCと正確性が一貫して高い。
データ保持率が100%のときは全手法が似た性能へ収束するが、リファラル時にはエンサンブルMCドロップアウトとMCドロップアウトの変種がより急激に改善を示し、不確実性推定が優れていることを示す。
Kaggleのアウトオブサンプルデータでは、Ensemble MC Dropoutが特定のリファラル率で最高のAUCと正確性を達成（例：50%保持時：AUC 88.1±1.2、Accuracy 92.4±0.9）。
Mean-field VIおよびMFVIは分布シフト（APTOS 2019）下でMCドロップアウトとDeep Ensemblesに比べて劣化しやすく、頑健性の違いを強調する。
UCI のような簡易ベンチマークへの過学習的不確実性が、実世界の大規模タスクの手法ランキングを誤導する可能性があると本研究は主張する。
ベンチマークとAPIは新しいBDLツールの評価を促進するため公開された。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。