QUICK REVIEW

[論文レビュー] A Lightweight Library for Energy-Based Joint-Embedding Predictive Architectures

Basile Terver, Randall Balestriero|arXiv (Cornell University)|Feb 3, 2026

Generative Adversarial Networks and Image Synthesis被引用数 0

ひとこと要約

EB-JEPAは、JEPAベースのモデルを画像表現、ビデオ予測、アクション条件付きワールドモデリングに実装するオープンソースライブラリ。単一GPUでトレーニング可能で、包括的なアブレーションとチュートリアルを提供。

ABSTRACT

We present EB-JEPA, an open-source library for learning representations and world models using Joint-Embedding Predictive Architectures (JEPAs). JEPAs learn to predict in representation space rather than pixel space, avoiding the pitfalls of generative modeling while capturing semantically meaningful features suitable for downstream tasks. Our library provides modular, self-contained implementations that illustrate how representation learning techniques developed for image-level self-supervised learning can transfer to video, where temporal dynamics add complexity, and ultimately to action-conditioned world models, where the model must additionally learn to predict the effects of control inputs. Each example is designed for single-GPU training within a few hours, making energy-based self-supervised learning accessible for research and education. We provide ablations of JEA components on CIFAR-10. Probing these representations yields 91% accuracy, indicating that the model learns useful features. Extending to video, we include a multi-step prediction example on Moving MNIST that demonstrates how the same principles scale to temporal modeling. Finally, we show how these representations can drive action-conditioned world models, achieving a 97% planning success rate on the Two Rooms navigation task. Comprehensive ablations reveal the critical importance of each regularization component for preventing representation collapse. Code is available at https://github.com/facebookresearch/eb_jepa.

研究の動機と目的

画像表現学習、ビデオ予測、アクション条件付き計画のためのアクセス可能でモジュール化されたJEPA実装を提供する。
正則化されたJEPAトレーニングが崩壊を防ぎ、有用な表現を生み出すことを示す。
小規模で教育的な用途のための包括的なアブレーションと実践的なハイパーパラメータガイドを提供する。
JEPAの原理を明確なドキュメントを通じて迅速な実験と理解を促進する。

提案手法

予測損失を崩壊を防ぐ正則化と組み合わせた統一的なJEPAエネルギー目的関数を定義する。
三つの設定を実例化する：Image-JEPA（ビュー不変表現）、Video-JEPA（潜在空間での時間予測）、AC-Video-JEPA（アクション条件付きワールドモデリング）。
表現空間の学習済みプロジェクターに適用された正則化子（VICRegまたはSIGReg）を用いたエネルギーベース学習を使用する。
自己回帰的推論とトレーニングをより良く整合させるためにマルチステップロールアウト損失を組み込む。
アクション条件付きモデルを追加の正則化子（時間的一致性、逆動力学）とMPPI/CEMを用いたプランニング目的で拡張する。

Figure 1: EB-JEPA is a modular code base and tutorial, providing self-contained implementations of Joint-Embedding Predictive Architecture for (a) self-supervised image representation learning (b) video prediction in latent space, and (c) action-conditioned world models that enable goal-directed pla

実験結果

リサーチクエスチョン

RQ1正則化を用いて学習したJEPAベースの表現は、画像、ビデオ、アクション条件付きタスクで崩壊を防ぐことができるか。
RQ2正則化子（VICReg vs SIGReg）とプロジェクター設計は CIFAR-10 のような標準ベンチマークで表現品質にどのような影響を与えるか。
RQ3マルチステップロールアウト学習は長期的な予測と下流のプランニング性能をAC-video-JEPAで改善するか。
RQ4追加の正則化子（時間的一致性、逆動力学）はランダム化環境での安定性とプランニング成功にどのような影響を与えるか。

主な発見

Method	Best acc.	Average acc.	w/o Projector	Hyperparams	Best projector
SIGReg	91.02%	89.22%	-3.3 points	1	2048 × 128
VICReg	90.12%	84.90%	-2.9 points	2	2048 × 1024

Image-JEPAはResNet-18を用いたCIFAR-10で約90–91%の線形プロービング精度を達成し、SIGRegが最良91.02%、VICRegが最良90.12%を記録。
学習済みプロジェクターを用いると、エンコーダ出力の正則化のみより約3ポイント向上する。
マルチステップロールアウトを用いたVideo-JEPAは予測品質を高く維持し、Moving MNISTで下流の平均精度を改善する。
AC-video-JEPAはTwo RoomsでMPPIによるプランニング成功率を97%に達成し、アブレーションではIDMが重要であり分散/共分散/時間的正則化子が性能に大きく寄与する。
正則化成分（分散、共分散、時間的類似性、逆動力学）は崩壊を防ぎ、有効なプランニングを可能にするために不可欠である。

Figure 2: Hyperparameter sensitivity comparison between SIGReg and VICReg on CIFAR-10. SIGReg demonstrates greater stability across different hyperparameter configurations, while VICReg achieves similar peak performance but requires more careful tuning.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。