[論文レビュー] Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
I-JEPA は、手作りの拡張を使わずに意味的な画像表現を学習し、文脈ブロックからターゲットブロック表現を予測することで、マスキングを用いた joint-embedding 予測フレームワークを採用する。ViT バックボーン上で効率的にスケールし、 semantic タスクでは視点不変性手法に匹敵し、低レベルタスクでは優れている。
This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction.
研究の動機と目的
- Motivate learning semantic image representations without hand-crafted view augmentations.
- Propose a non-generative, joint-embedding predictive architecture (I-JEPA) for images.
- Investigate masking strategies that yield semantic targets and informative contexts.
- Demonstrate scalability and efficiency of I-JEPA on large Vision Transformers.
- Evaluate I-JEPA across linear probing, semi-supervised, and transfer tasks.
提案手法
- Use a ViT context encoder to process a single context block.
- Predict the target block representations with a predictor conditioned on positional tokens.
- Represent targets via a target encoder whose weights are updated as an exponential moving average of the context encoder.
- Train by minimizing L2 distance between predicted and actual target representations in embedding space.
- Sample target blocks from the image with a multi-block masking strategy to ensure semantic targets and informative context.
- Compare I-JEPA to MAE, data2vec, and view-invariance methods under various settings (linear probing, 1% labels, transfer).
実験結果
リサーチクエスチョン
- RQ1Can semantic image representations be learned without hand-crafted augmentations by predicting embeddings across image blocks?
- RQ2What masking strategy (target size, context informativeness) yields the most semantic representations?
- RQ3How does I-JEPA scale in compute and model size compared to reconstruction and augmentation-based methods?
- RQ4Do the learned representations transfer effectively to classification and dense/low-level prediction tasks?
- RQ5Is predicting in representation space more effective than pixel-space reconstruction for semantic quality?
主な発見
- I-JEPA achieves strong linear-probing performance on ImageNet without view augmentations and can surpass MAE and data2vec under similar compute.
- Larger models and higher input resolution scale I-JEPA to match or exceed view-invariance methods on semantic tasks.
- I-JEPA improves low-level tasks (object counting and depth prediction) on Clevr compared with some view-based methods.
- I-JEPA is more compute-efficient than competing methods, requiring fewer pretraining iterations to reach strong performance, especially with ViT-H/14 and resolution boosts.
- Predicting in representation space (not pixel space) is crucial for maintaining semantic quality; pixel-space targets degrade performance.
- A multi-block masking strategy that combines informative context with large semantic targets yields better representations than rasterized or single-block masking.
- I-JEPA benefits from larger and more diverse pretraining data (ImageNet-22k) and scales better with model size for semantic tasks.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。