QUICK REVIEW

[论文解读] US-JEPA: A Joint Embedding Predictive Architecture for Medical Ultrasound

Ashwath Radhachandran, Vedrana Ivezić|arXiv (Cornell University)|Feb 22, 2026

Ultrasound Imaging and Elastography被引用 0

一句话总结

US-JEPA 引入一个基于静态教师的 SALT 框架，用于超声影像，在掩蔽嵌入空间学习潜在表示，在 UltraBench 的八个任务上实现强线性探测性能。

ABSTRACT

Ultrasound (US) imaging poses unique challenges for representation learning due to its inherently noisy acquisition process. The low signal-to-noise ratio and stochastic speckle patterns hinder standard self-supervised learning methods relying on a pixel-level reconstruction objective. Joint-Embedding Predictive Architectures (JEPAs) address this drawback by predicting masked latent representations rather than raw pixels. However, standard approaches depend on hyperparameter-brittle and computationally expensive online teachers updated via exponential moving average. We propose US-JEPA, a self-supervised framework that adopts the Static-teacher Asymmetric Latent Training (SALT) objective. By using a frozen, domain-specific teacher to provide stable latent targets, US-JEPA decouples student-teacher optimization and pushes the student to expand upon the semantic priors of the teacher. In addition, we provide the first rigorous comparison of all publicly available state-of-the-art ultrasound foundation models on UltraBench, a public dataset benchmark spanning multiple organs and pathological conditions. Under linear probing for diverse classification tasks, US-JEPA achieves performance competitive with or superior to domain-specific and universal vision foundation model baselines. Our results demonstrate that masked latent prediction provides a stable and efficient path toward robust ultrasound representations.

研究动机与目标

解决由于噪声和斑点伪影引起的超声影像表示学习的鲁棒性与数据效率问题。
开发一个在潜在空间中工作的 JEPA 自监督框架，使用固定的领域特定教师。
通过聚焦潜在、语义预测，降低对像素级重建的依赖。
在 UltraBench 上对所有公开的超声 foundation 模型进行线性探测基准测试，以标准化评估。

提出的方法

采用 SALT：冻结一个领域特定教师（URFM），提供稳定的潜在目标。
使用掩蔽的潜在预测目标，从同一图像的上下文块中预测目标嵌入。
引入 Ultrasound Region-Conditioning（USrc），将掩蔽限制在超声有效区域，避免非解剖内容。
训练一个上下文编码器（ViT-B/16）和一个预测器，使其与被冻结的教师嵌入之间的 Smooth L1 距离最小化。
在一个大型公开超声语料库上进行预训练（约 473 万帧，来自 49 个数据集）。
使用标准化的 UltraBench 线性探针在八个分类任务上进行评估。

Figure 1 : USrc-JEPA framework. Here we show the model training framework with USrc. URFM is the frozen teacher that extracts target embeddings. The student and predictor are jointly optimized with $\mathcal{L}_{US-JEPA}$ to align with the target.

实验结果

研究问题

RQ1静态教师的 SALT 框架是否能在潜在空间提升超声表示，相较于 EMA 基 JEPA 和领域特定基线？
RQ2US-JEPA 在多样化超声任务中的少样本线性探测表现如何？
RQ3学习的潜在空间对领域特定伪影和超声成像常见的腐蚀是否鲁棒？
RQ4将目标/上下文限定在超声有效区域（USrc）是否能提升表示质量？
RQ5在一个标准化的 UltraBench 基准测试中，公开的超声 foundation 模型在有线性探测的情况下如何比较？

主要发现

模型	AUL（Macro F1）	BUSBRA（Macro F1）	BUTTERFLY（Macro F1）	FATTY LIVER（Macro F1）	GBCU（Macro F1）	MMOTU（Macro F1）	POCUS（Macro F1）	TN5000（Macro F1）
DINOv3	64.3 ± 0.6	70.9 ± 1.7	91.7 ± 0.4	55.8 ± 5.5	61.7 ± 0.5	37.2 ± 0.6	91.4 ± 0.4	67.5 ± 0.4
I-JEPA	61.5 ± 1.1	71.2 ± 4.0	90.5 ± 0.6	54.8 ± 1.6	53.7 ± 0.4	35.3 ± 0.6	88.1 ± 0.4	68.9 ± 0.2
UltraSAM	62.6 ± 3.1	70.2 ± 3.1	89.6 ± 2.4	66.9 ± 3.3	43.5 ± 4.9	39.7 ± 1.8	87.3 ± 2.1	63.9 ± 2.0
SAMUS	40.2 ± 0.9	65.9 ± 0.3	91.5 ± 0.0	42.1 ± 0.0	48.8 ± 0.3	20.4 ± 0.2	76.2 ± 0.1	51.7 ± 0.0
EchoCare	49.2 ± 2.4	64.4 ± 0.0	84.1 ± 0.7	42.1 ± 0.0	36.2 ± 0.5	21.1 ± 0.1	73.8 ± 3.9	49.8 ± 3.8
USF-MAE	58.1 ± 1.4	62.9 ± 0.5	91.1 ± 0.3	42.1 ± 0.0	45.9 ± 0.3	28.7 ± 0.3	90.1 ± 0.0	56.3 ± 1.1
USFM	61.6 ± 1.2	74.6 ± 0.5	92.4 ± 0.3	73.6 ± 8.8	67.4 ± 0.6	33.8 ± 0.3	85.7 ± 0.5	65.0 ± 2.6
URFM	71.5 ± 1.1	69.5 ± 2.2	92.1 ± 0.4	82.6 ± 6.0	59.1 ± 1.7	42.7 ± 0.4	91.7 ± 0.3	77.4 ± 0.4
US-JEPA	69.6 ± 1.5	73.8 ± 1.1	90.8 ± 0.3	82.5 ± 1.1	67.0 ± 1.4	52.2 ± 0.2	93.1 ± 0.0	73.1 ± 0.7
USrc-JEPA	67.6 ± 0.5	76.0 ± 1.2	91.5 ± 0.6	89.2 ± 0.9	70.2 ± 0.5	46.8 ± 0.2	92.5 ± 0.1	70.8 ± 1.3

US-JEPA 与 USrc-JEPA 在 UltraBench 的八个任务中，有五个任务达到线性探测的最新状态（state-of-the-art）。
在 MMOTU 的八类卵巢肿瘤任务上，US-JEPA 达到 52.2% 的宏 F1，超越 URFM 9.5%。
US-JEPA 与 USrc-JEPA 对域特定腐蚀具有较强鲁棒性，特别是在斑点噪声高腐蚀水平下优于基线。
在少样本情形下，当标注数据 <10% 时，US-JEPA 的宏 F1 比 URFM 与 USFM 高出多达 18%。
US-JEPA 在与领域特定及通用基线的对比中表现具有竞争力，且常常超越它们，同时实现了公开基准测试的标准化。

Figure 2 : Distribution of pretraining data. To characterize the dataset composition at the organ level, we report the distribution of a. temporal sequences, including videos and volumes ( $n_{v}$ ), and b. individual static frames ( $n_{f}$ ).

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。