[Paper Review] CompRess: Self-Supervised Learning by Compressing Representations
This paper proposes CompRess, a self-supervised model compression method that transfers knowledge from a large, pre-trained self-supervised teacher model (e.g., SimCLR ResNet-50x4) to a smaller student model (e.g., AlexNet) by mimicking the relative similarity rankings of data points in the teacher's embedding space. The method achieves state-of-the-art performance on ImageNet, with the compressed AlexNet outperforming even the fully supervised AlexNet on linear evaluation (59.0% vs. 56.5%) and nearest neighbor evaluation (50.7% vs. 41.4%), marking the first time a self-supervised model surpasses its supervised counterpart on ImageNet classification itself.
Self-supervised learning aims to learn good representations with unlabeled data. Recent works have shown that larger models benefit more from self-supervised learning than smaller models. As a result, the gap between supervised and self-supervised learning has been greatly reduced for larger models. In this work, instead of designing a new pseudo task for self-supervised learning, we develop a model compression method to compress an already learned, deep self-supervised model (teacher) to a smaller one (student). We train the student model so that it mimics the relative similarity between the data points in the teacher's embedding space. For AlexNet, our method outperforms all previous methods including the fully supervised model on ImageNet linear evaluation (59.0% compared to 56.5%) and on nearest neighbor evaluation (50.7% compared to 41.4%). To the best of our knowledge, this is the first time a self-supervised AlexNet has outperformed supervised one on ImageNet classification. Our code is available here: https://github.com/UMBCvision/CompRess
Motivation & Objective
- To develop a model compression method that transfers knowledge from a large self-supervised teacher model to a smaller student model without requiring labels.
- To improve the performance of small, efficient models for downstream tasks like ImageNet classification by leveraging knowledge from deeper, self-supervised teachers.
- To enable privacy-preserving, on-device inference by compressing self-supervised models that generalize well without requiring data upload.
Proposed method
- The student model is trained to mimic the relative similarity rankings of data points in the teacher's embedding space, using a soft probability distribution derived from nearest neighbor distances.
- For each query image, the teacher computes distances to all anchor points in a memory bank, converts them to a probability distribution via a temperature-scaled softmax, and this distribution is used as a target for distillation.
- The method uses a momentum-based update for the memory bank in the 'Ours-2q' variant, improving stability and performance.
- The student model is trained using cross-entropy loss between its own similarity distribution and the teacher’s soft target distribution.
- The approach avoids direct contrastive learning or hard positive/negative pair supervision, instead focusing on preserving the relative ranking of similar and dissimilar samples.
- The method is evaluated using linear evaluation, nearest neighbor classification, and cluster alignment, with no hyperparameter tuning for the evaluation protocols.
Experimental results
Research questions
- RQ1Can knowledge distillation from a large self-supervised teacher model improve the performance of a smaller student model on downstream tasks like ImageNet classification?
- RQ2Does compressing a self-supervised teacher model lead to better generalization than training a small model with supervised loss on the same data?
- RQ3Can a self-supervised student model outperform a fully supervised model of the same architecture when evaluated on the ImageNet classification task itself?
- RQ4How do hyperparameters like temperature and memory bank size affect the performance of the compressed student model?
- RQ5Is the momentum update mechanism necessary for stable knowledge transfer in this compression setup?
Key findings
- The CompRess method achieves 59.0% top-1 accuracy on ImageNet linear evaluation using an AlexNet student model, outperforming the fully supervised AlexNet (56.5%).
- On nearest neighbor evaluation, the compressed AlexNet reaches 50.7% accuracy, significantly surpassing the supervised baseline (41.4%).
- The method achieves 59.3% accuracy on linear evaluation and 50.7% on nearest neighbor evaluation when compressing from a SimCLR ResNet-50x4 teacher, demonstrating state-of-the-art performance.
- The ablation study shows that a small temperature (e.g., 0.1) and large memory bank size improve performance by focusing on local neighborhood structure.
- Caching the teacher’s features reduces training time by a factor of nearly 3 with only a 0.4% drop in nearest neighbor accuracy, making it practical for large-scale training.
- The method is robust to the removal of momentum in the memory bank update, with minimal performance drop, suggesting momentum is not essential for this setup.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.