QUICK REVIEW

[論文レビュー] Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

Mahmoud Assran, Quentin Duval|arXiv (Cornell University)|Jan 19, 2023

Domain Adaptation and Few-Shot Learning被引用数 16

ひとこと要約

I-JEPA は、手作りの拡張を使わずに意味的な画像表現を学習し、文脈ブロックからターゲットブロック表現を予測することで、マスキングを用いた joint-embedding 予測フレームワークを採用する。ViT バックボーン上で効率的にスケールし、 semantic タスクでは視点不変性手法に匹敵し、低レベルタスクでは優れている。

ABSTRACT

This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction.

研究の動機と目的

Motivate learning semantic image representations without hand-crafted view augmentations.
Propose a non-generative, joint-embedding predictive architecture (I-JEPA) for images.
Investigate masking strategies that yield semantic targets and informative contexts.
Demonstrate scalability and efficiency of I-JEPA on large Vision Transformers.
Evaluate I-JEPA across linear probing, semi-supervised, and transfer tasks.

提案手法

Use a ViT context encoder to process a single context block.
Predict the target block representations with a predictor conditioned on positional tokens.
Represent targets via a target encoder whose weights are updated as an exponential moving average of the context encoder.
Train by minimizing L2 distance between predicted and actual target representations in embedding space.
Sample target blocks from the image with a multi-block masking strategy to ensure semantic targets and informative context.
Compare I-JEPA to MAE, data2vec, and view-invariance methods under various settings (linear probing, 1% labels, transfer).

実験結果

リサーチクエスチョン

RQ1Can semantic image representations be learned without hand-crafted augmentations by predicting embeddings across image blocks?
RQ2What masking strategy (target size, context informativeness) yields the most semantic representations?
RQ3How does I-JEPA scale in compute and model size compared to reconstruction and augmentation-based methods?
RQ4Do the learned representations transfer effectively to classification and dense/low-level prediction tasks?
RQ5Is predicting in representation space more effective than pixel-space reconstruction for semantic quality?

主な発見

I-JEPA achieves strong linear-probing performance on ImageNet without view augmentations and can surpass MAE and data2vec under similar compute.
Larger models and higher input resolution scale I-JEPA to match or exceed view-invariance methods on semantic tasks.
I-JEPA improves low-level tasks (object counting and depth prediction) on Clevr compared with some view-based methods.
I-JEPA is more compute-efficient than competing methods, requiring fewer pretraining iterations to reach strong performance, especially with ViT-H/14 and resolution boosts.
Predicting in representation space (not pixel space) is crucial for maintaining semantic quality; pixel-space targets degrade performance.
A multi-block masking strategy that combines informative context with large semantic targets yields better representations than rasterized or single-block masking.
I-JEPA benefits from larger and more diverse pretraining data (ImageNet-22k) and scales better with model size for semantic tasks.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。