[论文解读] Regularizing Class-wise Predictions via Self-knowledge Distillation
Introduces class-wise self-knowledge distillation (CS-KD), a regularizer that aligns predictive distributions of same-class samples within a single network to improve generalization and calibration.
Deep neural networks with millions of parameters may suffer from poor generalization due to overfitting. To mitigate the issue, we propose a new regularization method that penalizes the predictive distribution between similar samples. In particular, we distill the predictive distribution between different samples of the same label during training. This results in regularizing the dark knowledge (i.e., the knowledge on wrong predictions) of a single network (i.e., a self-knowledge distillation) by forcing it to produce more meaningful and consistent predictions in a class-wise manner. Consequently, it mitigates overconfident predictions and reduces intra-class variations. Our experimental results on various image classification tasks demonstrate that the simple yet powerful method can significantly improve not only the generalization ability but also the calibration performance of modern convolutional neural networks.
研究动机与目标
- Motivate regularization to curb overfitting in large neural networks.
- Propose CS-KD to regularize dark knowledge within a single network.
- Show that class-wise distillation reduces intra-class variation and improves calibration.
- Evaluate CS-KD across CIFAR-100, TinyImageNet, and fine-grained datasets with CNNs.
提出的方法
- Define a class-wise KL divergence loss that matches predictive distributions of two samples with the same label.
- Use a fixed copy of network parameters to stabilize gradients (self-distillation).
- Combine CS-KD with cross-entropy on the original samples, scaled by temperature T and weight lambda_cls.
- Train end-to-end with SGD and standard data augmentations; temperature and lambda_cls are hyperparameters.
- Optionally extend with an augmented-input loss CS-KD-E that adds a KL term between original and augmented samples.
实验结果
研究问题
- RQ1Can enforcing consistency between predictions of same-class samples within a single model improve generalization?
- RQ2Does CS-KD reduce intra-class prediction variance and improve calibration?
- RQ3How does CS-KD perform relative to other output regularizers and self-distillation methods on diverse datasets?
- RQ4Can CS-KD complement Mixup and KD to further improve performance?
- RQ5Is CS-KD scalable to large-scale datasets like ImageNet across multiple architectures?
主要发现
| 模型 | 方法 | CIFAR-100 | TinyImageNet | CUB-200-2011 | Stanford Dogs | MIT67 |
|---|---|---|---|---|---|---|
| Cross-entropy | Baseline | 24.71 ± 0.24 | 43.53 ± 0.19 | 46.00 ± 1.43 | 36.29 ± 0.32 | 44.75 ± 0.80 |
| CS-KD (ours) | Class-wise self-knowledge distillation | 21.99 ± 0.13 | 41.62 ± 0.38 | 33.28 ± 0.99 | 30.85 ± 0.28 | 40.45 ± 0.45 |
- CS-KD consistently lowers top-1 errors compared with cross-entropy and other regularizers across multiple datasets.
- On CIFAR-100, CS-KD achieves 21.99% top-1 error versus 24.71% with cross-entropy for ResNet-18.
- CS-KD improves calibration as shown by lower ECE values and more reliable confidence estimates.
- Combining CS-KD with Mixup or KD yields additional gains (e.g., Mixup + CS-KD reduces top-1 error on CIFAR-100 to 20.40%).
- CS-KD reduces intra-class variations in feature space and yields more meaningful predictions, as shown by improved R@1 and t-SNE visualizations.
- On ImageNet, CS-KD provides consistent top-1 improvements across ResNet-50, ResNet-101, and ResNeXt-101-32x4d.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。