[Paper Review] Colorization as a Proxy Task for Visual Understanding
The paper shows self-supervised colorization as a drop-in pretraining substitute for ImageNet, achieving state-of-the-art VOC results without ImageNet labels and providing a thorough analysis of loss, architecture, and training choices.
We investigate and improve self-supervision as a drop-in replacement for ImageNet pretraining, focusing on automatic colorization as the proxy task. Self-supervised training has been shown to be more promising for utilizing unlabeled data than other, traditional unsupervised learning methods. We build on this success and evaluate the ability of our self-supervised network in several contexts. On VOC segmentation and classification tasks, we present results that are state-of-the-art among methods not using ImageNet labels for pretraining representations. Moreover, we present the first in-depth analysis of self-supervision via colorization, concluding that formulation of the loss, training details and network architecture play important roles in its effectiveness. This investigation is further expanded by revisiting the ImageNet pretraining paradigm, asking questions such as: How much training data is needed? How many labels are needed? How much do features change when fine-tuned? We relate these questions back to self-supervision by showing that colorization provides a similarly powerful supervisory signal as various flavors of ImageNet pretraining.
Motivation & Objective
- Motivate the use of self-supervised learning to leverage unlabeled data for visual understanding.
- Investigate colorization as a proxy task for learning transferable visual representations.
- Evaluate colorization-based pretraining on VOC classification and segmentation benchmarks.
- Analyze how loss formulation, architecture, and training details affect learned representations.
Proposed method
- Train a colorization network that predicts color from grayscale using L*a*b space and a histogram-based hue/chroma loss.
- Use hypercolumns with sparse training to learn representations efficiently.
- Pretrain on 3.7M unlabeled images (ImageNet + Places205) and transfer to downstream tasks.
- Systematically compare colorization pretraining to ImageNet pretraining across architectures and data regimes.
- Explore training details such as learning rate schedules, receptive field enlargement, and batch normalization handling.
Experimental results
Research questions
- RQ1Can self-supervised colorization match or approach supervised ImageNet pretraining on VOC classification and segmentation?
- RQ2How do loss formulation and architectural choices influence the quality of learned representations?
- RQ3What is the impact of pretraining data size and label diversity on downstream performance?
- RQ4How does colorization-derived representation shift during fine-tuning compared to purely supervised pretraining?
Key findings
- Colorization-based pretraining achieves 60.0% mIU on VOC 2012 Segmentation with ResNet-152 and extended field of view, the highest reported without ImageNet labels.
- For VOC 2007 Classification, colorization pretraining reaches 77.3% mAP, state-of-the-art among non-ImageNet methods.
- Predicting color histograms in hue/chroma space yields better downstream results (52.9% mIU) than regression on color values (48.0% mIU).
- Increasing model complexity (AlexNet → VGG-16 → ResNet-152) yields larger gains with colorization pretraining, especially in small-sample regimes.
- Colorization features show substantial feature shift during fine-tuning, indicating learned representations are not merely a good initialization but are repurposed for downstream tasks.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.