QUICK REVIEW

[논문 리뷰] A Lightweight Library for Energy-Based Joint-Embedding Predictive Architectures

Basile Terver, Randall Balestriero|arXiv (Cornell University)|2026. 02. 03.

Generative Adversarial Networks and Image Synthesis인용 수 0

한 줄 요약

EB-JEPA는 이미지 표현, 비디오 예측, 액션 조건 월드 모델링을 위한 JEPA 기반 모델을 구현하는 오픈 소스 라이브러리로, 단일 GPU에서 전체 ablations 및 튜토리얼과 함께 학습 가능.

ABSTRACT

We present EB-JEPA, an open-source library for learning representations and world models using Joint-Embedding Predictive Architectures (JEPAs). JEPAs learn to predict in representation space rather than pixel space, avoiding the pitfalls of generative modeling while capturing semantically meaningful features suitable for downstream tasks. Our library provides modular, self-contained implementations that illustrate how representation learning techniques developed for image-level self-supervised learning can transfer to video, where temporal dynamics add complexity, and ultimately to action-conditioned world models, where the model must additionally learn to predict the effects of control inputs. Each example is designed for single-GPU training within a few hours, making energy-based self-supervised learning accessible for research and education. We provide ablations of JEA components on CIFAR-10. Probing these representations yields 91% accuracy, indicating that the model learns useful features. Extending to video, we include a multi-step prediction example on Moving MNIST that demonstrates how the same principles scale to temporal modeling. Finally, we show how these representations can drive action-conditioned world models, achieving a 97% planning success rate on the Two Rooms navigation task. Comprehensive ablations reveal the critical importance of each regularization component for preventing representation collapse. Code is available at https://github.com/facebookresearch/eb_jepa.

연구 동기 및 목표

이미지 표현 학습, 비디오 예측, 액션 조건 계획을 위한 접근 가능하고 모듈식 JEPA 구현 제공.
정규화된 JEPA 학습이 붕괴를 방지하고 유용한 표현을 생성한다는 것을 입증.
소 규모의 교육용 활용을 위한 포괄적 차단실험(ablations) 및 실용적 하이퍼파라미터 가이드 제공.
명확한 문서를 통해 JEPA 원리에 대한 빠른 실험과 이해 촉진.

제안 방법

예측 손실과 붕괴 방지를 위한 정규화를 결합한 통합 JEPA 에너지 목적 정의.
세 가지 설정 인스턴스화: Image-JEPA(뷰 불변 표현), Video-JEPA(잠재 공간에서의 시간 예측), AC-Video-JEPA(액션 조건 월드 모델링).
학습된 투사기(프로젝터)에 적용된 에너지 기반 학습과 정규화 방법(VICReg 또는 SIGReg) 사용.
자동회귀 추론에 맞춰 학습을 더 잘 정렬하기 위해 다단계 롤아웃 손실 포함.
추가 정규화(시간적 유사성, 역동성 역학) 및 MPPI/CEM을 이용한 계획 목표로 액션-조건화 모델 보강.

Figure 1: EB-JEPA is a modular code base and tutorial, providing self-contained implementations of Joint-Embedding Predictive Architecture for (a) self-supervised image representation learning (b) video prediction in latent space, and (c) action-conditioned world models that enable goal-directed pla

실험 결과

연구 질문

RQ1정규화로 학습된 JEPA 기반 표현이 이미지, 비디오, 액션-조건 작업 전반에서 붕괴를 방지할 수 있는가?
RQ2정규화 기법(VICReg vs SIGReg)과 프로젝터 설계가 CIFAR-10과 같은 표준 벤치마크에서 표현 품질에 어떤 영향을 주는가?
RQ3다단계 롤아웃 학습이 AC-비디오-JEPA의 장기 예측 및 다운스트림 계획 성능을 향상시키는가?
RQ4추가 정규화 요소(시간적 유사성, 역동성 역학)가 무작위 환경에서의 안정성 및 계획 성공에 어떤 영향을 주는가?

주요 결과

Method	Best acc.	Average acc.	w/o Projector	Hyperparams	Best projector
SIGReg	91.02%	89.22%	-3.3 points	1	2048 × 128
VICReg	90.12%	84.90%	-2.9 points	2	2048 × 1024

Image-JEPA는 CIFAR-10에서 ResNet-18으로 약 90~91% 선형 탐색 정확도 달성, SIGReg가 최고 91.02%, VICReg가 최고 90.12%.
학습된 프로젝터를 사용하면 인코더 출력의 정규화에 비해 약 3포인트의 성능 향상.
다단계 롤아웃이 있는 Video-JEPA는 예측 품질을 더 높게 유지하고 Moving MNIST에서 다운스트림 Average Precision을 향상.
AC-video-JEPA는 MPPI에서 Two Rooms에서 97% 계획 성공에 도달했고, 차례로 IDM이 중요하며 분산/공분산/시간적 정규화가 성능에 크게 기여한다는 차등 실험에서 확인.
정규화 구성요소(분산, 공분산, 시간적 유사성, 역동성 역학)는 붕괴를 방지하고 효과적인 계획 가능성을 위해 필수적이다.

Figure 2: Hyperparameter sensitivity comparison between SIGReg and VICReg on CIFAR-10. SIGReg demonstrates greater stability across different hyperparameter configurations, while VICReg achieves similar peak performance but requires more careful tuning.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.