QUICK REVIEW

[論文レビュー] Hydra: Preserving Ensemble Diversity for Model Distillation

Linh Tran, Bastiaan S. Veeling|arXiv (Cornell University)|Jan 14, 2020

Adversarial Robustness in Machine Learning参考文献 25被引用数 36

ひとこと要約

Hydraは単一の共有ボディと複数のヘッドを用いてアンサンブルを蒸留し、各メンバーの予測と不確実性を保持することで、標準的な蒸留より予測性能と不確実性定量化の双方を改善する。

ABSTRACT

Ensembles of models have been empirically shown to improve predictive performance and to yield robust measures of uncertainty. However, they are expensive in computation and memory. Therefore, recent research has focused on distilling ensembles into a single compact model, reducing the computational and memory burden of the ensemble while trying to preserve its predictive behavior. Most existing distillation formulations summarize the ensemble by capturing its average predictions. As a result, the diversity of the ensemble predictions, stemming from each member, is lost. Thus, the distilled model cannot provide a measure of uncertainty comparable to that of the original ensemble. To retain more faithfully the diversity of the ensemble, we propose a distillation method based on a single multi-headed neural network, which we refer to as Hydra. The shared body network learns a joint feature representation that enables each head to capture the predictive behavior of each ensemble member. We demonstrate that with a slight increase in parameter count, Hydra improves distillation performance on classification and regression settings while capturing the uncertainty behavior of the original ensemble over both in-domain and out-of-distribution tasks.

研究の動機と目的

蒸留後もアンサンブルの不確実性を保持する必要性に動機づけられる。
各メンバーの振る舞いを保存するためのマルチヘッド蒸留アーキテクチャを提案する。
分類と回帰タスクにおいて、標準蒸留と既存ネットワークに対してHydraを評価する。

提案手法

Hydraを導入する：共有ボディ1つとM個のヘッド（アンサンブル各メンバー1つ）。
各ヘッドは特定のアンサンブルメンバーを模倣する；ボディは共有特徴表現を提供する。
分類では各ヘッドと対応するアンサンブルメンバー間の平均KLダイバージェンスを最小化する（回帰ではガウス出力のKLを最小化）。
訓練中の分布を熱くするために温度Tを使用してクロスサポートを改善する。
2段階の訓練：まず平均アンサンブルを模倣する（Hintonヘッド）、次にすべてのヘッドを個々のメンバーに一致させて訓練する。
データセット間でKnowledge DistillationとPrior Networksと比較し、NLL、Brierスコア、精度、およびモデルの不確実性を報告する。

実験結果

リサーチクエスチョン

RQ1蒸留時にHydraは平均化ベースの蒸留と比較してアンサンブルの多様性を忠実に保持できるか。
RQ2Hydraはインドメインおよびアウトオブドメインデータで予測性能と不確実性定量化を改善するか。
RQ3Hydraはパラメータ効率とアンサンブル多様性の忠実性の間でどのようなトレードオフをするか。
RQ4Hydraが分類と回帰タスクに与える影響はどうなるか。

主な発見

モデル	ACC (MNIST)	NLL (MNIST)	BS (MNIST)	MU (MNIST)	ACC (CIFAR-10)	NLL (CIFAR-10)	BS (CIFAR-10)	MU (CIFAR-10)
Ensemble (M=50)	0.9851	0.0439	-0.9780	9.97e-06	0.9226	0.2392	-0.9033	0.1055
Prior Networks	0.9842	0.0521	-0.9285	0.1158	0.8731	0.4392	-0.8231	0.0280
Knowledge distillation	0.9843	0.0497	-0.9764	N/A	0.8933	0.3598	-0.8373	N/A
Hydra (head=[100,100,10])	0.9857	0.0465	-0.9776	2.28e-05	0.8992	0.3179	-0.8468	0.0074

HydraはMNISTおよびCIFAR-10でアンサンブルの予測性能と同等かそれ以上を達成する。
MNISTではHydraはNLL 0.0465、Brier −0.9776を達成し、アンサンブルNLL 0.0439およびBrier −0.9780に近く、MUは2.28e-5。
CIFAR-10ではHydraはACC 0.8992、NLL 0.3179を達成し、他の蒸留法よりアンサンブルに近い、MUは0.0074。
Hydraは複数の指標でKnowledge DistillationおよびPrior Networksを上回り、特に不確実性定量化（MU）とNLLで顕著。
Hydraは控えめなパラメータ増加とアンサンブル多様性の fidelity の実用的なバランスを提供する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。