QUICK REVIEW

[論文レビュー] Structural Knowledge Distillation for Object Detection

Philip de Rijk, Lukas Schneider|arXiv (Cornell University)|Nov 23, 2022

Advanced Neural Network Applications被引用数 21

ひとこと要約

この論文はピクセル単位の Lp 特徴蒸留を SSIM ベースの損失に置換し、輝度、コントラスト、構造を捉えることで、MSCOCO 上の RetinaNet と Faster R-CNN の AP の一貫した向上を達成し、しばしば最先端のKD 手法を上回る。

ABSTRACT

Knowledge Distillation (KD) is a well-known training paradigm in deep neural networks where knowledge acquired by a large teacher model is transferred to a small student. KD has proven to be an effective technique to significantly improve the student's performance for various tasks including object detection. As such, KD techniques mostly rely on guidance at the intermediate feature level, which is typically implemented by minimizing an lp-norm distance between teacher and student activations during training. In this paper, we propose a replacement for the pixel-wise independent lp-norm based on the structural similarity (SSIM). By taking into account additional contrast and structural cues, feature importance, correlation and spatial dependence in the feature space are considered in the loss formulation. Extensive experiments on MSCOCO demonstrate the effectiveness of our method across different training schemes and architectures. Our method adds only little computational overhead, is straightforward to implement and at the same time it significantly outperforms the standard lp-norms. Moreover, more complex state-of-the-art KD methods using attention-based sampling mechanisms are outperformed, including a +3.5 AP gain using a Faster R-CNN R-50 compared to a vanilla model.

研究の動機と目的

大規模教師器とコンパクトな学生モデル間の知識ギャップを、複雑なサンプリング手法を用いずに縮小する動機づけ。
SSIM に基づく特徴蒸留損失を導入し、教師と学生の特徴間の局所平均、分散、相関を捉える。
SSIM ベースの KD が MSCOCO の複数のアーキテクチャとトレーニング設定で優れた検出性能を示すことを実証。
提案手法は軽量で（ワンラインのコード変更）、注意ベースのサンプリングに依存する最先端 KD 手法を上回ることができる。

提案手法

従来の Lp 特徴蒸留を、局所的な輝度（平均）、コントラスト（分散）、構造（相関）を捉える SSIM ベースの損失に置換。
11x11 ガウスパッチに対して 3 つの SSIM 成分（輝度、コントラスト、構造）を計算して結合損失 L_ssim を形成。
蒸留損失を適用する前に、教師/学生の特徴を正規化し、必要に応じて適応させる（最小最大正規化と 1x1 畳み込み）。
蒸留損失を元の検出損失と組み合わせて L = lambda * L_feat + L_det とする。ここで lambda は調整可能な重み。
MSCOCO 上で RetinaNet および Faster R-CNN with ResNet/ResNeXt バックボーンを用いて、PyTorch/MMDetection2 でトレーニングして評価。
輝度/コントラスト/構造（alpha、beta、gamma）、パッチサイズ、適応層の有無などの影響を含むアブレーションを探索。

実験結果

リサーチクエスチョン

RQ1SSIM ベースの蒸留は従来の Lp ベースの特徴蒸留を object detectors に対して上回るか？
RQ2輝度、コントラスト、構造成分は知識移転と検出性能にどのように寄与するか？
RQ3SSIM ベースの KD は異なる検出器アーキテクチャとトレーニングスケジュールで頑健か？
RQ4単一行のコード変更で複雑なサンプリング機構を使わずに効果的な KD を実装できるか？
RQ5SSIM ベースの KD は注意ベースのサンプリングに依存する最先端の KD 手法と比較してどうか？

主な発見

Backbone / Method	AP	AP50	AP75	AP_S	AP_M	AP_L
RetinaNet-R50 (Ours, SSIM)	40.1	59.2	43.1	23.1	44.6	53.2
RetinaNet-R50 (L2)	36.8	55.7	39.1	20.6	40.5	47.3
RetinaNet-R50 (L1)	38.7	57.6	41.6	22.7	42.7	50.5
Faster R-CNN-R50 (Ours, SSIM)	40.9	61.0	44.9	23.7	44.5	53.5
Faster R-CNN-R50 (L2)	37.4	57.6	40.9	21.2	41.3	48.1
Faster R-CNN-R50 (L1)	38.6	58.8	42.1	21.8	42.1	49.9

SSIM ベースの蒸留は MSCOCO で RetinaNet および Faster R-CNN に対して Lp ノルムを最大 3.7 AP 上回る。
SSIM の使用により特徴空間全体により分布された誤差信号が得られ、学生が教師へより効果的に誘導される。
構造成分（gamma）の影響が最も強く、gamma-のみの構成で最大 +3.2 AP の効果を達成。
SSIM ベースの KD をさまざまなバックボーンと検出器と組み合わせても、S/M/L の物体サイズ全体で AP が一貫して向上。
我々の手法は、Zhang and Ma; Kang et al. などの最先端 KD 手法と AP の獲得において同等かそれを上回り、しばしば大規模物体の性能（AP_L）で上回る。
適応層は教師と学生のアーキテクチャが異なる場合に有益であるが、そうでなければ任意かもしれない。
一行の変更（L2 を L_ssim に置換）だけで既存パイプラインに手法を適用できる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。