QUICK REVIEW

[논문 리뷰] Graph-based Knowledge Distillation by Multi-head Attention Network

Seunghyun Lee, Byung Cheol Song|arXiv (Cornell University)|2019. 07. 04.

Advanced Neural Network Applications참고 문헌 31인용 수 39

한 줄 요약

MHGD를 소개하는 그래프 기반 지식 증류 프레임워크로, 다중-헤드 어텐션을 사용하여 교사 네트워크로부터 데이터셋 임베딩 지식을 증류하고 학생 네트워크로 전달하며, CIFAR100과 TinyImageNet에서 성능을 향상시킨다.

ABSTRACT

Knowledge distillation (KD) is a technique to derive optimal performance from a small student network (SN) by distilling knowledge of a large teacher network (TN) and transferring the distilled knowledge to the small SN. Since a role of convolutional neural network (CNN) in KD is to embed a dataset so as to perform a given task well, it is very important to acquire knowledge that considers intra-data relations. Conventional KD methods have concentrated on distilling knowledge in data units. To our knowledge, any KD methods for distilling information in dataset units have not yet been proposed. Therefore, this paper proposes a novel method that enables distillation of dataset-based knowledge from the TN using an attention network. The knowledge of the embedding procedure of the TN is distilled to graph by multi-head attention (MHA), and multi-task learning is performed to give relational inductive bias to the SN. The MHA can provide clear information about the source dataset, which can greatly improves the performance of the SN. Experimental results show that the proposed method is 7.05% higher than the SN alone for CIFAR100, which is 2.46% higher than the state-of-the-art.

연구 동기 및 목표

CNN에서 특징 벡터뿐만 아니라 데이터 간 intra-data 관계에 대한 지식을 증류할 필요성을 제시하고, 데이터셋 임베딩을 개선한다.
다중-헤드 어텐션을 사용해 데이터셋 임베딩 절차를 포착하는 그래프 기반 증류 방법을 제안한다.
다중 작업 학습 설정에서 전이 손실을 통해 관계형 귀납 바이어스를 학생 네트워크에 상속시키는 전이 작업을 가능하게 한다.

제안 방법

KD-SVD를 통해 두 지점의 피처 맵을 피처 벡터로 압축한다.
전면(front-end) 및 후면(back-end) 피처 벡터 간의 관계를 다중-헤드 어텐션 네트워크(MHAN)로 계산한다.
다중 어텐션 헤드를 학습시켜 그래프 기반 관계를 생성하고 임베딩 지식을 증류한다.
전이 손실이 있는 다중 작업 학습을 통해 증류된 그래프 기반 지식을 학생에게 전달한다.
학습을 안내하기 위해 부드럽게 정렬된 어텐션 맵과 교사-학생 그래프 간 KL-발산을 사용한다.

실험 결과

연구 질문

RQ1다중-헤드 어텐션으로 캡처된 그래프 기반 임베딩 지식이 기존의 특징 벡터 기반 방법을 넘어 KD를 개선할 수 있는가?
RQ2어텐션 헤드의 수가 증류된 지식의 질과 SN 성능에 어떤 영향을 미치는가?
RQ3그래프 기반 지식 전달이 초기화 기반 또는 단일 작업 KD 방법보다 이점을 제공하는가?

주요 결과

MHGD는 SN 단독 대비 CIFAR100에서 약 7%, TinyImageNet에서 약 4%의 성능 향상을 보인다.
MHGD는 KD-SVD 및 최첨단 방법보다 성능이 우수하다(예: VGG 및 WResNet 백본).
어텐션 헤드 수를 늘리면 일반적으로 어느 정도까지 성능이 향상되나, 이후에는 과도한 복잡성으로 인해 이득이 포화되거나 감소할 수 있다.
그래프 기반 지식 전달을 활용한 다중 작업 학습은 훈련 중 성능 향상을 유지하는 반면, 일부 초기화 기반 KD 방법은 그렇지 않다.
이 방법은 아키텍처에 구애받지 않는 이점을 제공하여 VGG, MobileNet, ResNet 백본 전반에서 SN 성능을 향상시킨다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.