[Paper Review] Localizing Objects with Self-Supervised Transformers and no Labels
LOST localizes objects within a single image using patch-level self-supervised transformer features, without any labels, achieving state-of-the-art CorLoc in unsupervised object discovery and enabling unsupervised class-agnostic and class-aware detectors.
Localizing objects in image collections without supervision can help to avoid expensive annotation campaigns. We propose a simple approach to this problem, that leverages the activation features of a vision transformer pre-trained in a self-supervised manner. Our method, LOST, does not require any external object proposal nor any exploration of the image collection; it operates on a single image. Yet, we outperform state-of-the-art object discovery methods by up to 8 CorLoc points on PASCAL VOC 2012. We also show that training a class-agnostic detector on the discovered objects boosts results by another 7 points. Moreover, we show promising results on the unsupervised object discovery task. The code to reproduce our results can be found at https://github.com/valeoai/LOST.
Motivation & Objective
- Motivate object localization in image collections without annotations to reduce labeling costs.
- Leverage patch-level correlations from a self-supervised vision transformer to localize objects within a single image.
- Show that seed-based localization can outperform region proposals and enable downstream unsupervised detection tasks.
- Demonstrate that pseudo-labels from LOST can train class-agnostic and class-aware detectors without supervision.
Proposed method
- Use a vision transformer pretrained with DINO to extract patch-based features from a single image.
- Construct a patch similarity graph using positive correlations between patch features and identify an initial seed as the patch with the lowest degree in this graph.
- Expand seeds by iteratively adding patches positively correlated with the seed and residing in the lowest-degree set.
- Compute a binary object mask by correlating image patches with the seeds and extract the object bounding box from the largest connected component containing the seed.
- Train a class-agnostic detector on LOST boxes to obtain multi-object detections per image.
- Cluster CLS tokens of the discovered objects to obtain pseudo-labels for unsupervised class-aware detection, and use Hungarian matching to map clusters to real classes for evaluation.
Experimental results
Research questions
- RQ1Can self-supervised transformer activations localize objects within a single image without any annotations?
- RQ2How does seed selection and seed expansion based on patch correlations impact localization quality?
- RQ3Can LOST-derived boxes train effective class-agnostic detectors and improve unsupervised object detection when combined with clustering-based pseudo-labels?
Key findings
- LOST outperforms state-of-the-art unsupervised object discovery methods in CorLoc on VOC07, VOC12, and COCO_20k by substantial margins.
- Training a class-agnostic detector on LOST boxes further improves CorLoc by 4-7 points across evaluated datasets.
- Unsupervised class-aware detection trained with LOST boxes and clustering achieves competitive AP@0.5 on VOC07, including higher-than-weakly-supervised methods for several classes (e.g., aeroplane, bus, dog, horse, train, cat).
- LOST-based pseudo-boxes for detector training significantly improve AP compared to the initial pseudo-boxes.
- Backbone choice matters; ViT-S/16 with DINO features yields the best performance among tested backbones.
- LOST enables scalable, linear-complexity localization per image without inter-image search, suitable for large datasets.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.