QUICK REVIEW

[論文レビュー] TransKD: Transformer Knowledge Distillation for Efficient Semantic Segmentation

Ruiping Liu, Kailun Yang|arXiv (Cornell University)|Feb 27, 2022

Advanced Neural Network Applications被引用数 27

ひとこと要約

TransKD は大規模な教師モデルからトランスフォーマーのパッチ埋め込みと特徴マップの両方を蒸留し、効率的なセマンティックセグメンテーションを実現する。FLOPsを大幅に削減しつつ、精度を競争力のある水準に維持する。

ABSTRACT

Semantic segmentation benchmarks in the realm of autonomous driving are dominated by large pre-trained transformers, yet their widespread adoption is impeded by substantial computational costs and prolonged training durations. To lift this constraint, we look at efficient semantic segmentation from a perspective of comprehensive knowledge distillation and aim to bridge the gap between multi-source knowledge extractions and transformer-specific patch embeddings. We put forward the Transformer-based Knowledge Distillation (TransKD) framework which learns compact student transformers by distilling both feature maps and patch embeddings of large teacher transformers, bypassing the long pre-training process and reducing the FLOPs by >85.0%. Specifically, we propose two fundamental modules to realize feature map distillation and patch embedding distillation, respectively: (1) Cross Selective Fusion (CSF) enables knowledge transfer between cross-stage features via channel attention and feature map distillation within hierarchical transformers; (2) Patch Embedding Alignment (PEA) performs dimensional transformation within the patchifying process to facilitate the patch embedding distillation. Furthermore, we introduce two optimization modules to enhance the patch embedding distillation from different perspectives: (1) Global-Local Context Mixer (GL-Mixer) extracts both global and local information of a representative embedding; (2) Embedding Assistant (EA) acts as an embedding method to seamlessly bridge teacher and student models with the teacher's number of channels. Experiments on Cityscapes, ACDC, NYUv2, and Pascal VOC2012 datasets show that TransKD outperforms state-of-the-art distillation frameworks and rivals the time-consuming pre-training method. The source code is publicly available at https://github.com/RuipingL/TransKD.

研究の動機と目的

長い事前学習への依存を減らすことで、トランスフォーマーを用いた効率的なセマンティックセグメンテーションを動機づける。
トランスフォーマー特有のパッチ埋め込みと特徴マップを活用する包括的な蒸留フレームワークを開発する。
マルチソース知識伝達を可能にする新規モジュールを通じて教師-学生間のギャップを埋める。
パッチ埋め込みと特徴マップの蒸留を併用することで難例のセグメンテーションが改善されることを示す。

提案手法

4つのトランスフォーマー段階にわたって、教師から学生へパッチ埋め込みと特徴マップの両方を蒸留する。
埋め込みのチャネル次元を整える学習可能な射影を備えた Patch Embedding Alignment (PEA) を導入する。
蒸留のために埋め込みのグローバルおよびローカルな文脈を捉える Global-Local Context Mixer (GL-Mixer) を使用する。
関係ベースの特徴マップ蒸留のために、チャネル注意を介してクローステージの特徴マップを融合する Cross Selective Fusion (CSF) を適用する。
追加のトレーニング段階なしで擬似アシスタントモデルを形成することで教師と学生のチャネルを橋渡しする Embedding Assistant (EA) を組み込む。
パッチ埋め込み蒸留（PEA/GL-Mixer/EA）と特徴マップ蒸留（CSF）を統合損失としてクロスエントロピーとともに組み合わせる。
Cityscapes、ACDC、 NYUv2、Pascal VOC2012 における TransKD のバリアントを評価し、KD ベースラインを上回る改善と事前学習との競争力を示す。

Figure 3: (a)-(c) Knowledge distillation in computer vision is split into three categories [ 16 ] : response-based knowledge distillation, feature-based knowledge distillation, and relation-based knowledge distillation. (d) TransKD extracts the relation-based knowledge of feature maps and transforme

実験結果

リサーチクエスチョン

RQ1大規模なトランスフォーマー教師からコンパクトな学生トランスフォーマーへ、セマンティックセグメンテーションにおける知識をどのように伝達できるか？
RQ2トランスフォーマー特有のパッチ埋め込み蒸留を組み込むことは、特徴マップ蒸留だけでは得られない性能向上をもたらすか？
RQ3クロスステージの特徴融合と埋め込み整列は、長い事前学習なしで教師-学生のギャップを効果的に橋渡しできるか？
RQ4TransKD のバリアントは、さまざまなデータセット（Cityscapes、ACDC、NYUv2、Pascal VOC2012）およびバックボーンモデルでどのように性能を示すか？

主な発見

TransKD は FLOPs を 85% 超削減しつつ、競争力のある精度を維持する。
TransKD-Base は KR ベースの蒸馏を上回る mIoU を 5.18% 向上させ、追加の学習パラメータは 0.21M のみ。
Cityscapes では、非事前学習 SegFormer-B0 の mIoU を 13.12% 改善し、事前学習済みでは 2.09% 改善。
TransKD は複数のトランスフォーマーアーキテクチャとデータセットにわたって一貫して精度を向上させる。
最良の TransKD バリアントは 3.72M パラメータで 75.74% の mIoU を達成。
TransKD は性能面で時間のかかる事前学習法に対抗しつつ、より効率的である。

Figure 4: Our knowledge distillation framework TransKD. It is divided into two parts: knowledge distillation of patch embeddings (red arrows and rectangles) and feature maps (green arrows and rectangles). The loss function consists of two distillation terms (HCL and MSE) and a cross-entropy term. (a

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。