QUICK REVIEW

[論文レビュー] Benchmarking Uncertainty Disentanglement: Specialized Uncertainties for Specialized Tasks

Bálint Mucsányi, Michael Kirchhof|arXiv (Cornell University)|Feb 29, 2024

Complex Systems and Decision Making被引用数 10

ひとこと要約

この論文は、ImageNetの7つの実務タスクにわたり12の不確実性推定量を総合的にベンチマークし、情報理論的およびBregmanに基づく混同行動の分離が実践可能かを評価している。結論として、分離は一般に達成されず、タスク固有の性能が優先される。

ABSTRACT

Uncertainty quantification, once a singular task, has evolved into a spectrum of tasks, including abstained prediction, out-of-distribution detection, and aleatoric uncertainty quantification. The latest goal is disentanglement: the construction of multiple estimators that are each tailored to one and only one source of uncertainty. This paper presents the first benchmark of uncertainty disentanglement. We reimplement and evaluate a comprehensive range of uncertainty estimators, from Bayesian over evidential to deterministic ones, across a diverse range of uncertainty tasks on ImageNet. We find that, despite recent theoretical endeavors, no existing approach provides pairs of disentangled uncertainty estimators in practice. We further find that specialized uncertainty tasks are harder than predictive uncertainty tasks, where we observe saturating performance. Our results provide both practical advice for which uncertainty estimators to use for which specific task, and reveal opportunities for future research toward task-centric and disentangled uncertainties. All our reimplementations and Weights & Biases logs are available at https://github.com/bmucsanyi/untangle.

研究の動機と目的

予測不確実性をタスク固有の成分（アレータ性不確実性 vs. エピステミック不確実性）へ分離する必要性を動機づける。
現在の不確実性推定量が大規模な実践でこれらの成分を真に分離できているかを評価する。
異なる実務タスク（ID/OOD、棄権など）に対してどの推定量が最も適しているかを特徴づける。
タスク中心の不確実性推定器開発と再現可能なベンチマークのための指針を提供する。

提案手法

プラグアンドプレイモジュールとして12個の不確実性定量推定量を再実装し、ImageNet-1k上の7タスクで評価。
推定量を分布型（クラス確率のq(f)を生み出す）と決定論的（スカラー不確実性出力）に分類。
情報理論ITとBregman分解の2つの分離パラダイムを適用して、アレータ性成分とエピステミック成分の対を得る。
分布出力を8つの集約戦略で集約して、必要に応じてスカラー不確実性を得る。
堅牢性検証とデータセットサイズ効果のためにCIFAR-10でも実験を再現する。

Figure 1 : Six out of seven distributional methods exhibit a severely high rank correlation between the information-theoretical aleatoric and epistemic components when evaluated on ImageNet-ReaL. These methods violate a necessary condition of uncertainty disentanglement.

実験結果

リサーチクエスチョン

RQ1現在の不確実性推定量は実践で真に分離されたアレータ性とエピステミック成分を提供しているか？
RQ2特定の実務タスク（例：OOD検出、棄権、正確性予測など）に対してどの推定量が最も適しているか？
RQ3ITとBregman分解は大規模データセットと小規模データセットでどのように振る舞うか？
RQ4分離に関する結論はデータセットとタスク間で伝播するか？

主な発見

手法	正確性	棄権	対数確率	Brier	アレータ性不確実性	ECE	OOD
Deep Ensemble	2	1	1	1	1	4	6
Dropout	1	2	2	2	2	3	2
Baseline	6	3	5	3	3	7	9
SNGP	5	8	4	6	7	1	4
GP	4	6	3	4	6	2	5
Mahalanobis	11	11	–	–	11	–	1
Shallow Ensemble	3	5	6	5	5	5	3
Laplace	7	7	7	7	9	6	11
Correctness Pred.	9	9	9	9	8	9	7
Loss Prediction	10	10	–	–	10	–	8

分離は実践では一般的に達成されず；ほとんどのITおよびBregman分解はImageNet-ReaL上の7つの分布型手法でアレータ性とエピステミック推定が高度に相関している。
ODD検出は密度ベースの特化手法（例：マハラノビス）から恩恵を受けるが、これらは他のタスク（アレータ性不確実性や予測不確実性）には適用されない。
アレータ性不確実性は方法間で計測が難しく、深層アンサンブルとドロップアウトは多くの現代的な密度ベース法よりも地上 truthの人間の不確実性（ReaL）との整合性が比較的高い。
正確性予測性能は方法間で飽和しており、ドロップアウトと深層アンサンブルは堅牢な結果を提供する。一方、特化したOOD検出器（例：マハラノビス）は分布内正しさタスクへの一般化が不十分。
タスク間で単一の優勝手法は存在せず、コストが許容される場合にはドロップアウトと深層アンサンブルが汎用的な良いベースラインを提供する。

Figure 2 : Mahalanobis—a direct OOD detector, dropout, and shallow ensembles distinguish ID and OOD samples considerably better (AUROC $\geq 0.728$ ) than the baseline (AUROC $=0.674$ ). OOD samples are perturbed by ImageNet-C corruptions of severity level two.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。