Skip to main content
QUICK REVIEW

[论文解读] Structural Knowledge Distillation for Object Detection

Philip de Rijk, Lukas Schneider|arXiv (Cornell University)|Nov 23, 2022
Advanced Neural Network Applications被引用 21
一句话总结

论文用基于 SSIM 的损失替代像素级的 Lp 特征蒸馏,以捕捉亮度、对比度和结构,带来 RetinaNet 和 Faster R-CNN 在 MSCOCO 上的一致 AP 提升,且常常超过现有的 KD 方法。

ABSTRACT

Knowledge Distillation (KD) is a well-known training paradigm in deep neural networks where knowledge acquired by a large teacher model is transferred to a small student. KD has proven to be an effective technique to significantly improve the student's performance for various tasks including object detection. As such, KD techniques mostly rely on guidance at the intermediate feature level, which is typically implemented by minimizing an lp-norm distance between teacher and student activations during training. In this paper, we propose a replacement for the pixel-wise independent lp-norm based on the structural similarity (SSIM). By taking into account additional contrast and structural cues, feature importance, correlation and spatial dependence in the feature space are considered in the loss formulation. Extensive experiments on MSCOCO demonstrate the effectiveness of our method across different training schemes and architectures. Our method adds only little computational overhead, is straightforward to implement and at the same time it significantly outperforms the standard lp-norms. Moreover, more complex state-of-the-art KD methods using attention-based sampling mechanisms are outperformed, including a +3.5 AP gain using a Faster R-CNN R-50 compared to a vanilla model.

研究动机与目标

  • Motivate reducing knowledge gap between large teachers and compact students in object detection without complex sampling schemes.
  • Introduce a feature-based distillation loss based on SSIM to capture local mean, variance, and cross-correlation between teacher and student features.
  • Demonstrate that SSIM-based KD yields superior detection performance across multiple architectures and training setups on MSCOCO.
  • Show that the proposed method is lightweight (one-line code change) and can outperform state-of-the-art KD methods relying on attention-based sampling.

提出的方法

  • Replace the conventional Lp feature distillation with an SSIM-based loss capturing local luminance (mean), contrast (variance), and structure (cross-correlation).
  • Compute three SSIM components (luminance, contrast, structure) over 11x11 Gaussian patches to form a combined loss L_ssim.
  • Normalize and optionally adapt teacher/student features (min-max normalization and a 1x1 conv) before applying the distillation loss.
  • Combine distillation loss with the original detection loss as L = lambda * L_feat + L_det, where lambda is a tunable weight.
  • Evaluate on MSCOCO using RetinaNet and Faster R-CNN with ResNet/ResNeXt backbones, training in PyTorch/MMDetection2.
  • Explore ablations including influence of luminance/contrast/structure (alpha, beta, gamma), patch size, and presence of adaptation layers.

实验结果

研究问题

  • RQ1Does SSIM-based distillation outperform traditional Lp-based feature distillation for object detectors?
  • RQ2How do luminance, contrast, and structure components contribute to knowledge transfer and detection performance?
  • RQ3Is the SSIM-based KD robust across different detector architectures and training schedules?
  • RQ4Can simple one-line code changes implement effective KD without complex sampling mechanisms?
  • RQ5How does SSIM-based KD compare to state-of-the-art KD methods that rely on attention-based sampling?

主要发现

Backbone / MethodAPAP50AP75AP_SAP_MAP_L
RetinaNet-R50 (Ours, SSIM)40.159.243.123.144.653.2
RetinaNet-R50 (L2)36.855.739.120.640.547.3
RetinaNet-R50 (L1)38.757.641.622.742.750.5
Faster R-CNN-R50 (Ours, SSIM)40.961.044.923.744.553.5
Faster R-CNN-R50 (L2)37.457.640.921.241.348.1
Faster R-CNN-R50 (L1)38.658.842.121.842.149.9
  • SSIM-based distillation outperforms Lp norms by up to 3.7 AP on MSCOCO across RetinaNet and Faster R-CNN.
  • Using SSIM yields more distributed error signaling across the feature space, guiding the student toward the teacher more effectively.
  • The structure component (gamma) has the strongest positive impact, with gamma-only configurations achieving up to +3.2 AP.
  • Integrating SSIM-based KD with various backbones and detectors consistently improves AP across S, M, and L object sizes.
  • Our method matches or surpasses state-of-the-art KD methods (e.g., Zhang and Ma; Kang et al.) in AP gains, often with better large-object performance (AP_L).
  • Adaptation layers are beneficial when teacher and student architectures differ; otherwise, they may be optional.
  • A single line change (replacing L2 with L_ssim) suffices to deploy the method in existing pipelines.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。