Skip to main content
QUICK REVIEW

[Paper Review] Localizing Objects with Self-Supervised Transformers and no Labels

Oriane Siméoni, Gilles Puy|arXiv (Cornell University)|Sep 29, 2021
Advanced Neural Network Applications77 references107 citations
TL;DR

LOST localizes objects within a single image using patch-level self-supervised transformer features, without any labels, achieving state-of-the-art CorLoc in unsupervised object discovery and enabling unsupervised class-agnostic and class-aware detectors.

ABSTRACT

Localizing objects in image collections without supervision can help to avoid expensive annotation campaigns. We propose a simple approach to this problem, that leverages the activation features of a vision transformer pre-trained in a self-supervised manner. Our method, LOST, does not require any external object proposal nor any exploration of the image collection; it operates on a single image. Yet, we outperform state-of-the-art object discovery methods by up to 8 CorLoc points on PASCAL VOC 2012. We also show that training a class-agnostic detector on the discovered objects boosts results by another 7 points. Moreover, we show promising results on the unsupervised object discovery task. The code to reproduce our results can be found at https://github.com/valeoai/LOST.

Motivation & Objective

  • Motivate object localization in image collections without annotations to reduce labeling costs.
  • Leverage patch-level correlations from a self-supervised vision transformer to localize objects within a single image.
  • Show that seed-based localization can outperform region proposals and enable downstream unsupervised detection tasks.
  • Demonstrate that pseudo-labels from LOST can train class-agnostic and class-aware detectors without supervision.

Proposed method

  • Use a vision transformer pretrained with DINO to extract patch-based features from a single image.
  • Construct a patch similarity graph using positive correlations between patch features and identify an initial seed as the patch with the lowest degree in this graph.
  • Expand seeds by iteratively adding patches positively correlated with the seed and residing in the lowest-degree set.
  • Compute a binary object mask by correlating image patches with the seeds and extract the object bounding box from the largest connected component containing the seed.
  • Train a class-agnostic detector on LOST boxes to obtain multi-object detections per image.
  • Cluster CLS tokens of the discovered objects to obtain pseudo-labels for unsupervised class-aware detection, and use Hungarian matching to map clusters to real classes for evaluation.

Experimental results

Research questions

  • RQ1Can self-supervised transformer activations localize objects within a single image without any annotations?
  • RQ2How does seed selection and seed expansion based on patch correlations impact localization quality?
  • RQ3Can LOST-derived boxes train effective class-agnostic detectors and improve unsupervised object detection when combined with clustering-based pseudo-labels?

Key findings

  • LOST outperforms state-of-the-art unsupervised object discovery methods in CorLoc on VOC07, VOC12, and COCO_20k by substantial margins.
  • Training a class-agnostic detector on LOST boxes further improves CorLoc by 4-7 points across evaluated datasets.
  • Unsupervised class-aware detection trained with LOST boxes and clustering achieves competitive AP@0.5 on VOC07, including higher-than-weakly-supervised methods for several classes (e.g., aeroplane, bus, dog, horse, train, cat).
  • LOST-based pseudo-boxes for detector training significantly improve AP compared to the initial pseudo-boxes.
  • Backbone choice matters; ViT-S/16 with DINO features yields the best performance among tested backbones.
  • LOST enables scalable, linear-complexity localization per image without inter-image search, suitable for large datasets.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.