QUICK REVIEW

[논문 리뷰] Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning

Zeyuan Allen-Zhu, Yuanzhi Li|arXiv (Cornell University)|2020. 12. 17.

Adversarial Robustness in Machine Learning참고 문헌 92인용 수 151

한 줄 요약

본 논문은 동일한 아키텍처를 가진 신경망의 앙상블이 동일한 데이터에서 학습될 때 다-view 데이터 구조에서 테스트 정확도를 크게 향상시킬 수 있으며 이 향상이 단일 모델로 증류될 수 있음을 이론적·실험적으로 보여주고, 자기 증류가 암시적 앙상블+증류로 작용하는지 분석한다.

ABSTRACT

We formally study how ensemble of deep learning models can improve test accuracy, and how the superior performance of ensemble can be distilled into a single model using knowledge distillation. We consider the challenging case where the ensemble is simply an average of the outputs of a few independently trained neural networks with the SAME architecture, trained using the SAME algorithm on the SAME data set, and they only differ by the random seeds used in the initialization. We show that ensemble/knowledge distillation in Deep Learning works very differently from traditional learning theory (such as boosting or NTKs, neural tangent kernels). To properly understand them, we develop a theory showing that when data has a structure we refer to as ``multi-view'', then ensemble of independently trained neural networks can provably improve test accuracy, and such superior test accuracy can also be provably distilled into a single model by training a single model to match the output of the ensemble instead of the true label. Our result sheds light on how ensemble works in deep learning in a way that is completely different from traditional theorems, and how the ``dark knowledge'' is hidden in the outputs of the ensemble and can be used in distillation. In the end, we prove that self-distillation can also be viewed as implicitly combining ensemble and knowledge distillation to improve test accuracy.

연구 동기 및 목표

전통적 학습 이론을 넘어 깊은 학습에서 앙상블 방법이 테스트 정확도를 향상시키는 이유를 설명한다.
앙상블의 이득이 입증될 수 있는 다-view 데이터 설정을 도입하고 형식화한다.
동일한 데이터로 학습된 단일 모델에 앙상블 향상을 증류할 수 있음을 보인다.
자기 증류가 효과적으로 앙상블과 증류를 결합하여 성능을 향상시킨다는 것을 입증한다.

제안 방법

완화된 ReLU 활성화를 가진 2계층 합성곱 신경망에 대한 이론적 분석.
다-view 데이터 분포와 단-view 데이터 분포의 정의와 이에 상응하는 데이터 생성 과정.
경사하강법 학습 결과 단일 모델은 학습 정확도는 완벽하지만 D에서 테스트 오차가 거의 무작위 수준에 가깝다는 것을 보여준다.
독립적으로 학습된 모델들의 앙상블이 테스트 정확도를 현저히 향상시킨다는 것을 입증한다.
앙상블 출력 값을 모사하도록 새로운 모델을 학습시키는 것(지식 증류)이 유사하게 향상된 테스트 정확도를 가져다줌을 보여준다.
자기 증류가 암시적 앙상블+증류로 작용하고 추가 이익을 낳는다는 주장을 제시한다.

실험 결과

연구 질문

RQ1다-view 데이터 설정에서 독립적으로 학습된 동일한 구조의 신경망들의 출력을 평균화하는 것이 테스트 정확도에 어떤 영향을 미치는가?
RQ2같은 학습 데이터에서 앙상블의 성능 향상을 모사하도록 단일 모델을 학습시켜 얻는 것이 재현될 수 있는가(지식 증류)?
RQ3다크 노하우의 기제와 증류 및 자기 증류에서의 역할은 무엇인가?
RQ4깊은 학습에서 자기 증류가 암시적 앙상블 및 증류와 어떻게_related되는가?

주요 결과

제안된 설정에서 단일 모델은 학습 정확도는 완벽하지만 테스트 오차는 0.49μ–0.51μ에 불과하다.
L개의 독립적으로 학습된 모델의 앙상블은 높은 확률로 테스트 오차 ≤ 0.01μ를 달성한다.
앙상블 출력과 일치하도록 학습된 별도의 모델(지식 증류) 역시 테스트 오차 ≤ 0.01μ를 달성한다.
자기 증류(같은 크기의 다른 모델로부터 증류하는 것)는 테스트 오차 ≤ 0.26μ를 달성할 수 있다.
임의 특성(NTK) 위의 지식 증류는 앙상블 이익을 재현하지 못한다, NTK/특성 뷰와 실제 딥러닝 특성 학습 간의 차이를 강조한다.
결과는 딥러닝에서의 앙상블/지식 증류가 다-view 데이터 하의 특징 학습 역학에서 비롯되며, 전통적 앙상블 이론에 의해서만은 아니다를 강조한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.