Skip to main content
QUICK REVIEW

[论文解读] VJEPA: Variational Joint Embedding Predictive Architectures as Probabilistic World Models

Yongchao Huang|arXiv (Cornell University)|Jan 20, 2026
Adversarial Robustness in Machine Learning被引用 0
一句话总结

VJEPA 在 JEPA 基础上增加了对未来潜在状态的概率预测模型,使得规划具备不确定性感知,并将 JEPA 与贝叶斯滤波和预测状态表示相连接,而无需重构观测。

ABSTRACT

Joint Embedding Predictive Architectures (JEPA) offer a scalable paradigm for self-supervised learning by predicting latent representations rather than reconstructing high-entropy observations. However, existing formulations rely on extit{deterministic} regression objectives, which mask probabilistic semantics and limit its applicability in stochastic control. In this work, we introduce \emph{Variational JEPA (VJEPA)}, a extit{probabilistic} generalization that learns a predictive distribution over future latent states via a variational objective. We show that VJEPA unifies representation learning with Predictive State Representations (PSRs) and Bayesian filtering, establishing that sequential modeling does not require autoregressive observation likelihoods. Theoretically, we prove that VJEPA representations can serve as sufficient information states for optimal control without pixel reconstruction, while providing formal guarantees for collapse avoidance. We further propose \emph{Bayesian JEPA (BJEPA)}, an extension that factorizes the predictive belief into a learned dynamics expert and a modular prior expert, enabling zero-shot task transfer and constraint (e.g. goal, physics) satisfaction via a Product of Experts. Empirically, through a noisy environment experiment, we demonstrate that VJEPA and BJEPA successfully filter out high-variance nuisance distractors that cause representation collapse in generative baselines. By enabling principled uncertainty estimation (e.g. constructing credible intervals via sampling) while remaining likelihood-free regarding observations, VJEPA provides a foundational framework for scalable, robust, uncertainty-aware planning in high-dimensional, noisy environments.

研究动机与目标

  • 推动并形式化将 JEPA 建模为一个概率预测状态空间模型。
  • 展示 JEPA 表示可以作为对最优控制的信息充分态而无需像素重构。
  • 将 JEPA 与 Predictive State Representations 和 贝叶斯滤波统一起来。
  • 引入 Bayesian JEPA (BJEPA),以实现模块化先验与零-shot 任务迁移。
  • 展示在嘈杂环境中避免塌缩并进行不确定性感知的预测。

提出的方法

  • 引入 p_phi(Z_T | Z_C, xi_T) 作为对未来潜在状态的学习预测分布。
  • 使用来自目标编码器且带 EMA 更新的近似推断分布 q_theta'(Z_T | x_T)。
  • 通过变分目标 L_VJEPA = E[-log p_phi(Z_T|Z_C,xi_T)] + beta E[KL(q_theta'(Z_T|x_T) || p(Z_T))] 进行训练。
  • 保持 JEPA 结构,其中上下文 Z_C = f_theta(x_C) 和目标结构 xi_T。
  • 可选地包含观测模型 p_psi(x_T|Z_T),但不对其进行优化;学习依赖表示空间预测。
  • 在潜在空间提供预测与不确定性传播用于规划。
Figure 3: Performance metrics across noise scales. Top Row: Training set $R^{2}$ . Bottom Row: Test set $R^{2}$ (Generalization). The generative models (VAE, AR) degrade linearly as noise increases, tracking the distractor (Bottom Right). The JEPA-based models (Blue/Cyan/Purple) maintain high signal
Figure 3: Performance metrics across noise scales. Top Row: Training set $R^{2}$ . Bottom Row: Test set $R^{2}$ (Generalization). The generative models (VAE, AR) degrade linearly as noise increases, tracking the distractor (Bottom Right). The JEPA-based models (Blue/Cyan/Purple) maintain high signal

实验结果

研究问题

  • RQ1确定性 JEPA 间接优化的概率目标是什么,如何推广以处理不确定性?
  • RQ2JEPA 是否可形式化为一个潜在动力系统,其学习的表示是在不重构观测的情况下的最优控制的充分信息态?
  • RQ3JEPA 与贝叶斯滤波及预测状态表示的关系如何,是否可以通过贝叶斯因子注入结构先验?
  • RQ4引入时间结构是否会强制自回归观测似然,还是在避免无关干扰的同时保持无似然性?

主要发现

  • VJEPA 提供对未来潜在状态的概率预测模型,实现不确定性估计与多模态未来。
  • 该框架在不需要重构观测的前提下,将 JEPA 与预测状态表示和贝叶斯滤波统一起来。
  • BJEPA 通过将预测信念分解为一个动力学专家和一个模块化先验专家,扩展 VJEPA,并通过 Product of Experts 实现约束和零-shot 任务迁移。
  • 用变分目标进行训练在目标多样性和非平凡条件下提供避免塌缩的保证。
  • 经验性 toy 结果表明 VJEPA 和 BJEPA 能过滤掉高方差的干扰噪声,并在嘈杂环境中支持不确定性感知的规划。
Figure 4: Latent Reconstructions at varying noise scales. At $\sigma=8.0$ (Right), the VAE and AR reconstructions (dashed lines) track the high-frequency noise. In contrast, BJEPA and VJEPA (solid lines) successfully filter the noise and track the underlying true signal (black line).
Figure 4: Latent Reconstructions at varying noise scales. At $\sigma=8.0$ (Right), the VAE and AR reconstructions (dashed lines) track the high-frequency noise. In contrast, BJEPA and VJEPA (solid lines) successfully filter the noise and track the underlying true signal (black line).

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。