QUICK REVIEW

[论文解读] Ensemble Distribution Distillation

Andrey Malinin, Bruno Mlodozeniec|arXiv (Cornell University)|Apr 30, 2019

Anomaly Detection Techniques and Applications参考文献 37被引用 82

一句话总结

Ensemble Distribution Distillation (EnD2) 将 Ensemble 预测的分布蒸馏到一个以 Dirichlet 建模的单一 Prior Network 中，保持多样性以改善不确定性估计和 OOD 检测，并在 CIFAR/TinyImageNet 上接近集成的性能。

ABSTRACT

Ensembles of models often yield improvements in system performance. These ensemble approaches have also been empirically shown to yield robust measures of uncertainty, and are capable of distinguishing between different \emph{forms} of uncertainty. However, ensembles come at a computational and memory cost which may be prohibitive for many applications. There has been significant work done on the distillation of an ensemble into a single model. Such approaches decrease computational cost and allow a single model to achieve an accuracy comparable to that of an ensemble. However, information about the \emph{diversity} of the ensemble, which can yield estimates of different forms of uncertainty, is lost. This work considers the novel task of \emph{Ensemble Distribution Distillation} (EnD$^2$) --- distilling the distribution of the predictions from an ensemble, rather than just the average prediction, into a single model. EnD$^2$ enables a single model to retain both the improved classification performance of ensemble distillation as well as information about the diversity of the ensemble, which is useful for uncertainty estimation. A solution for EnD$^2$ based on Prior Networks, a class of models which allow a single neural network to explicitly model a distribution over output distributions, is proposed in this work. The properties of EnD$^2$ are investigated on both an artificial dataset, and on the CIFAR-10, CIFAR-100 and TinyImageNet datasets, where it is shown that EnD$^2$ can approach the classification performance of an ensemble, and outperforms both standard DNNs and Ensemble Distillation on the tasks of misclassification and out-of-distribution input detection.

研究动机与目标

Motivate why ensembles improve accuracy and uncertainty estimation but are costly.
Define Ensemble Distribution Distillation (EnD2) as preserving ensemble diversity in a single model.
Propose an approach using Prior Networks to model distributions over output distributions.
Evaluate EnD2 on artificial data and standard vision datasets to compare with EnD and PN baselines.

提出的方法

Model ensemble outputs as samples from a distribution over output distributions using Dirichlet parameters.
Use Prior Networks to parameterize a distribution over categorical distributions via Dirichlet concentration parameters.
Train the EnD2 model on a transfer dataset derived from ensemble predictions with temperature annealing to stabilize learning.
Minimize the negative log-likelihood of the Dirichlet-distributed outputs on the transfer dataset.
Compute uncertainty measures (total, data, and knowledge) via entropy and mutual information on the Dirichlet outputs.
Optionally use auxiliary data to better capture OOD behavior and ensemble diversity.

实验结果

研究问题

RQ1Can EnD2 accurately reproduce the predictive distribution of an ensemble, including its uncertainty breakdown into knowledge and data components?
RQ2Does EnD2 retain the ensemble's classification performance and improve misclassification and OOD detection compared to standard distillation?
RQ3What is the impact of auxiliary data on EnD2’s calibration, NLL, and OOD detection performance?
RQ4How does EnD2 compare to traditional Ensemble Distillation and Prior Networks on CIFAR-10/100 and Tiny ImageNet?

主要发现

EnD2 largely preserves ensemble classification performance and improves misclassification detection compared to individual models.
EnD2 can decompose total uncertainty into data and knowledge uncertainty similarly to the ensemble on in-domain data.
EnD2 generally matches or exceeds Ensemble Distillation in OOD detection when auxiliary data are used.
Calibration metrics (NLL and ECE) show EnD2 often benefits, though benefits vary with datasets and auxiliary data usage.
Prior Networks alone with auxiliary data underperform ensemble-based methods in PRR and some calibration metrics, highlighting EnD2’s advantage in capturing ensemble diversity.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。