QUICK REVIEW

[Paper Review] Localizing Objects with Self-Supervised Transformers and no Labels

Oriane Siméoni, Gilles Puy|arXiv (Cornell University)|Sep 29, 2021

Advanced Neural Network Applications77 references107 citations

TL;DR

LOST localizes objects within a single image using patch-level self-supervised transformer features, without any labels, achieving state-of-the-art CorLoc in unsupervised object discovery and enabling unsupervised class-agnostic and class-aware detectors.

ABSTRACT

Localizing objects in image collections without supervision can help to avoid expensive annotation campaigns. We propose a simple approach to this problem, that leverages the activation features of a vision transformer pre-trained in a self-supervised manner. Our method, LOST, does not require any external object proposal nor any exploration of the image collection; it operates on a single image. Yet, we outperform state-of-the-art object discovery methods by up to 8 CorLoc points on PASCAL VOC 2012. We also show that training a class-agnostic detector on the discovered objects boosts results by another 7 points. Moreover, we show promising results on the unsupervised object discovery task. The code to reproduce our results can be found at https://github.com/valeoai/LOST.

Motivation & Objective

Motivate object localization in image collections without annotations to reduce labeling costs.
Leverage patch-level correlations from a self-supervised vision transformer to localize objects within a single image.
Show that seed-based localization can outperform region proposals and enable downstream unsupervised detection tasks.
Demonstrate that pseudo-labels from LOST can train class-agnostic and class-aware detectors without supervision.

Proposed method

Use a vision transformer pretrained with DINO to extract patch-based features from a single image.
Construct a patch similarity graph using positive correlations between patch features and identify an initial seed as the patch with the lowest degree in this graph.
Expand seeds by iteratively adding patches positively correlated with the seed and residing in the lowest-degree set.
Compute a binary object mask by correlating image patches with the seeds and extract the object bounding box from the largest connected component containing the seed.
Train a class-agnostic detector on LOST boxes to obtain multi-object detections per image.
Cluster CLS tokens of the discovered objects to obtain pseudo-labels for unsupervised class-aware detection, and use Hungarian matching to map clusters to real classes for evaluation.

Experimental results

Research questions

RQ1Can self-supervised transformer activations localize objects within a single image without any annotations?
RQ2How does seed selection and seed expansion based on patch correlations impact localization quality?
RQ3Can LOST-derived boxes train effective class-agnostic detectors and improve unsupervised object detection when combined with clustering-based pseudo-labels?

Key findings

LOST outperforms state-of-the-art unsupervised object discovery methods in CorLoc on VOC07, VOC12, and COCO_20k by substantial margins.
Training a class-agnostic detector on LOST boxes further improves CorLoc by 4-7 points across evaluated datasets.
Unsupervised class-aware detection trained with LOST boxes and clustering achieves competitive AP@0.5 on VOC07, including higher-than-weakly-supervised methods for several classes (e.g., aeroplane, bus, dog, horse, train, cat).
LOST-based pseudo-boxes for detector training significantly improve AP compared to the initial pseudo-boxes.
Backbone choice matters; ViT-S/16 with DINO features yields the best performance among tested backbones.
LOST enables scalable, linear-complexity localization per image without inter-image search, suitable for large datasets.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.