QUICK REVIEW

[논문 리뷰] TransKD: Transformer Knowledge Distillation for Efficient Semantic Segmentation

Ruiping Liu, Kailun Yang|arXiv (Cornell University)|2022. 02. 27.

Advanced Neural Network Applications인용 수 27

한 줄 요약

TransKD는 대형 교사로부터 트랜스포머 패치 임베딩과 피처 맵을 모두 증류하여 컴팩트한 학생 모델에서 효율적인 의미 분할을 달성하고, FLOPs를 크게 줄이면서도 경쟁력 있는 정확도를 유지합니다.

ABSTRACT

Semantic segmentation benchmarks in the realm of autonomous driving are dominated by large pre-trained transformers, yet their widespread adoption is impeded by substantial computational costs and prolonged training durations. To lift this constraint, we look at efficient semantic segmentation from a perspective of comprehensive knowledge distillation and aim to bridge the gap between multi-source knowledge extractions and transformer-specific patch embeddings. We put forward the Transformer-based Knowledge Distillation (TransKD) framework which learns compact student transformers by distilling both feature maps and patch embeddings of large teacher transformers, bypassing the long pre-training process and reducing the FLOPs by >85.0%. Specifically, we propose two fundamental modules to realize feature map distillation and patch embedding distillation, respectively: (1) Cross Selective Fusion (CSF) enables knowledge transfer between cross-stage features via channel attention and feature map distillation within hierarchical transformers; (2) Patch Embedding Alignment (PEA) performs dimensional transformation within the patchifying process to facilitate the patch embedding distillation. Furthermore, we introduce two optimization modules to enhance the patch embedding distillation from different perspectives: (1) Global-Local Context Mixer (GL-Mixer) extracts both global and local information of a representative embedding; (2) Embedding Assistant (EA) acts as an embedding method to seamlessly bridge teacher and student models with the teacher's number of channels. Experiments on Cityscapes, ACDC, NYUv2, and Pascal VOC2012 datasets show that TransKD outperforms state-of-the-art distillation frameworks and rivals the time-consuming pre-training method. The source code is publicly available at https://github.com/RuipingL/TransKD.

연구 동기 및 목표

긴 사전 학습에 의존하는 정도를 줄여 트랜스포머를 활용한 효율적인 의미 분할을 촉진한다.
트랜스포머 특화 패치 임베딩과 피처 맵을 활용하는 포괄적인 증류 프레임워크를 개발한다.
다중 소스 지식 전달을 가능하게 하는 새로운 모듈을 통해 교사-학생 간의 격차를 해소한다.
패치 임베딩과 피처 맵을 함께 증류하면 어려운 예제의 분할 성능이 향상됨을 보인다.

제안 방법

네 개 트랜스포머 단계에 걸쳐 교사로부터 학생으로 패치 임베딩과 피처 맵을 모두 증류한다.
Patch Embedding Alignment (PEA)를 도입하고 임베딩의 채널 차원을 정렬하기 위한 학습 가능한 투영을 도입한다.
글로벌- 로컬 컨텍스트 믹서(GL-Mixer)를 사용하여 증류를 위한 임베딩의 전역 및 국지 컨텍스트를 포착한다.
교차 스테이지 피처 맵을 채널 주의로 융합하기 위해 Cross Selective Fusion (CSF)를 적용하여 관계 기반 피처 맵 증류를 수행한다.
추가 학습 단계 없이 의사 보조 모델을 구성하여 교사와 학생 간 채널을 잇는 Embedding Assistant (EA)를 도입한다.
패치 임베딩 증류(PEA/GL-Mixer/EA)와 피처 맵 증류(CSF)를 교차 엔트로피와 함께 통합 손실로 결합한다.
Cityscapes, ACDC, NYUv2, Pascal VOC2012에서 TransKD 변형을 평가하여 KD 기반 기준선 대비 개선점을 보여주고 사전 학습과의 경쟁력을 입증한다.

실험 결과

연구 질문

RQ1대형 트랜스포머 교사에서 컴팩트한 학생 트랜스포머로 의미 분할 지식을 어떻게 전달할 수 있는가?
RQ2트랜스포머 특화 패치 임베딩 증류를 피처 맵 증류만으로는 얻지 못하는 성능 향상을 제공하는가?
RQ3교차 스테이지 피처 융합과 임베딩 정렬이 긴 사전 학습 없이도 교사-학생 격차를 효과적으로 해소할 수 있는가?
RQ4TransKD 변형은 다양한 데이터셋(Cityscapes, ACDC, NYUv2, Pascal VOC2012) 및 백본 모델에서 어떻게 성능을 보이는가?

주요 결과

TransKD는 85% 이상의 FLOPs 감소를 달성하면서도 경쟁력 있는 정확도를 유지한다.
TransKD-Base는 0.21M의 추가 학습 파라미터로 KR 기반 증류보다 mIoU를 5.18% 향상시킨다.
Cityscapes에서 비사전 학습 SegFormer-B0의 mIoU를 13.12% 향상시키고, 사전 학습된 경우는 2.09% 향상시킨다.
TransKD는 다양한 트랜스포머 아키텍처와 데이터셋에서 일관되게 정확도를 향상시킨다.
최고의 TransKD 변형은 3.72M 파라미터로 75.74%의 mIoU를 달성한다.
TransKD는 성능 측면에서 시간 소요가 큰 사전 학습 방법과 견줄 만하며, 더 효율적이다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.