QUICK REVIEW

[論文レビュー] Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach

Huy V. Vo, Vasil Khalidov|arXiv (Cornell University)|May 24, 2024

Data Mining Algorithms and Applications被引用数 5

ひとこと要約

この論文は、自己教師付き学習の大規模で多様かつバランスの取れた事前学習データセットを作成するための階層的k-meansベースのデータキュレーションパイプラインを提案し、複数ドメインにわたるSSL表現を改善する。

ABSTRACT

Self-supervised features are the cornerstone of modern machine learning systems. They are typically pre-trained on data collections whose construction and curation typically require extensive human effort. This manual process has some limitations similar to those encountered in supervised learning, e.g., the crowd-sourced selection of data is costly and time-consuming, preventing scaling the dataset size. In this work, we consider the problem of automatic curation of high-quality datasets for self-supervised pre-training. We posit that such datasets should be large, diverse and balanced, and propose a clustering-based approach for building ones satisfying all these criteria. Our method involves successive and hierarchical applications of $k$-means on a large and diverse data repository to obtain clusters that distribute uniformly among data concepts, followed by a hierarchical, balanced sampling step from these clusters. Extensive experiments on three different data domains including web-based images, satellite images and text show that features trained on our automatically curated datasets outperform those trained on uncurated data while being on par or better than ones trained on manually curated data. Code is available at https://github.com/facebookresearch/ssl-data-curation.

研究の動機と目的

Define what makes a good pre-training dataset for self-supervised learning (size, diversity, balance).
Develop an automatic, task-agnostic data curation pipeline to achieve these properties without labels.
Show that curated datasets yield better SSL representations than uncurated data and are competitive with manually curated data.
Demonstrate the method’s effectiveness across multiple data domains (web images, text, satellite imagery).

提案手法

Propose a hierarchical k-means approach to partition a large data pool into clusters that distribute uniformly over concepts.
Introduce a resampling-clustering step to approximate sampling from a uniform distribution over the data support.
Use centroids from successive k-means levels to form a tree; sample leaves to build a balanced dataset (flat vs hierarchical sampling).
Demonstrate that centroids of higher-level clusters approximate a uniform distribution over the data support (Lemma 1 and related discussion).
Provide sampling strategies (random, closest, furthest) and discuss cluster counts per level to balance concepts and sub-concepts.

実験結果

リサーチクエスチョン

RQ1Can hierarchical k-means with resampling produce data subsets that are close to uniform over the data support?
RQ2Does automatically curated data improve self-supervised representations across images and text compared to uncurated data?
RQ3Is the curated data competitive with manually curated datasets across multiple domains?
RQ4How do sampling strategies and cluster counts affect the balance and quality of the curated dataset?

主な発見

Curated datasets yield significant gains on benchmarks compared to raw uncurated data.
Curated data are competitive with, and often on par with or better than, manually curated datasets.
The method improves robustness, out-of-distribution generalization, and long-tail performance in SSL.
The approach is demonstrated across web-based images, text, and satellite imagery domains.
Hierarchical sampling ensures balance among high-level concepts and sub-concepts in the curated data.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。