QUICK REVIEW

[논문 리뷰] Wasserstein Distance Rivals Kullback-Leibler Divergence for Knowledge Distillation

Jiayi Lv, Hee-Chul Yang|arXiv (Cornell University)|2024. 12. 11.

Brain Tumor Detection and Classification인용 수 5

한 줄 요약

본 논문은 Wasserstein 거리 기반의 지식 증류 접근법 WKD를 제시하여 로짓에 대한 범주 간 관계(IR)를 이용한 추론(WKD-L)과 중간 피처에 대한 연속 분포 매칭(WKD-F)을 가능하게 하며, ImageNet, CIFAR-100, MS-COCO에서 KL-Div 변형 및 최첨단 증류 방법보다 우수함을 보인다.

ABSTRACT

Since pioneering work of Hinton et al., knowledge distillation based on Kullback-Leibler Divergence (KL-Div) has been predominant, and recently its variants have achieved compelling performance. However, KL-Div only compares probabilities of the corresponding category between the teacher and student while lacking a mechanism for cross-category comparison. Besides, KL-Div is problematic when applied to intermediate layers, as it cannot handle non-overlapping distributions and is unaware of geometry of the underlying manifold. To address these downsides, we propose a methodology of Wasserstein Distance (WD) based knowledge distillation. Specifically, we propose a logit distillation method called WKD-L based on discrete WD, which performs cross-category comparison of probabilities and thus can explicitly leverage rich interrelations among categories. Moreover, we introduce a feature distillation method called WKD-F, which uses a parametric method for modeling feature distributions and adopts continuous WD for transferring knowledge from intermediate layers. Comprehensive evaluations on image classification and object detection have shown (1) for logit distillation WKD-L outperforms very strong KL-Div variants; (2) for feature distillation WKD-F is superior to the KL-Div counterparts and state-of-the-art competitors. The source code is available at https://peihuali.org/WKD

연구 동기 및 목표

범주별 KL-발산을 넘어 크로스-카테고리 IR를 활용하여 지식 증류를 개선하려는 동기를 제시한다.
로짓(WKD-L)과 중간 피처(WKD-F)에 대한 WD 기반 증류 방법을 제안한다.
로짓 증류를 위해 Centered Kernel Alignment(CKA)으로 카테고리 IR을 모델링하고 이를 WD 수송비용으로 변환한다.
가우시안으로 모델링된 피처 분포를 이용해 중간 계층 피처 분포를 계산하고, 리만 거리(metric)로 WD를 구해 피처를 증류한다.

제안 방법

교사 특징에서 계산된 CKA를 통해 카테고리 간 IR을 정의하고 이를 로짓 증류를 위한 WD 수송비용으로 변환한다.
엔트로피 정규화된 수송 문제와 IR 기반 유사도에서 도출된 비용으로 교사-학생 로짓 간 이산 WD를 형식화한다.
로짓에 대해 타깃과 비타깃의 구분을 포함하는 두 항 손실을 도입하여 비타깃에 대한 WD와 타깃에 대한 교차 엔트로피를 결합한다.
피처의 경우 교사와 학생 분포를 평균과 공분산이 있는 가우시안으로 모델링하고 가우시안 간 WD를 닫힌 형태로 사용한다(평균 및 공분산 항의 합).
실용성을 위해 공간 피라미딩을 옵션으로 적용하고 실용성을 위해 Gaussian Diag 대 Full 공분산을 선택하며, 평균-공분산 기여를 감마 매개변수로 조정한다.

(a) Real-world categories exhibit rich interrelations (IRs) in feature space, e.g., dog is near other mammal while far from artifact like car. We quantify pairwise IRs as feature similarities among categories. Best viewed by zooming in .

실험 결과

연구 질문

RQ1WD 기반 증류가 크로스-카테고리 IR를 활용하여 로짓 증류에서 KL-Div 기반 방법보다 우수한가?
RQ2중간 계층 피처를 가우시안으로 모델링하고 WD를 적용하는 것이 KL-Div 및 비모수 방법에 비해 지식 전달을 개선하는가?
RQ3IR 모델링 방법(CKA 및 다양한 커널)의 WKD-L 성능에 어떤 영향을 미치는가?
RQ4WKD-L과 WKD-F를 개별적으로 및 결합하여 이미지 분류 및 물체 검출 태스크에서 어떤 성과를 내는가?

주요 결과

WKD-L은 ImageNet 및 CIFAR-100에서 로짓 증류에서 강력한 KL-Div 변형들을 능가한다.
WKD-F는 피처 증류에서 KL-Div 대응자들을 능가하며, 가우시안(Diag)이 견고성과 효율 측면에서 선호되는 경우가 많다.
카테고리 IR을 CKA로 모델링하는 것이 WD 기반 로짓 증류를 향상시키며, 특히 RBF 또는 선형 커널과 함께 성능이 향상된다.
WKD-L과 WKD-F를 결합하면 분류 및 검출 과제에서 각각 단독으로 사용할 때보다 추가적인 성능 향상을 얻는다.
MS-COCO 객체 검출에서 WD 기반 증류는 KL-Div 기반 방법에 비해 경쟁력 있는 이득을 보인다.

(b) For logit distillation, discrete WD performs cross-category comparison by exploiting pairwise IRs, in contrast to KL-Div that is a category-to-category measure and lacks a mechanism to use such IRs (cf. Figure 2 ). For feature distillation, we use Gaussians for distribution modeling and continuo

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.