QUICK REVIEW

[论文解读] Time Series, Vision, and Language: Exploring the Limits of Alignment in Contrastive Representation Spaces

Pratham Yashwante, Rose Yu|arXiv (Cornell University)|Feb 22, 2026

Language and cultural evolution被引用 0

一句话总结

该论文系统性研究时间序列、视觉和语言三模态在对比学习中的对齐，揭示跨模态的非对称性与饱和现象，以及信息密度和视觉锚定的影响作用。

ABSTRACT

The Platonic Representation Hypothesis posits that learned representations from models trained on different modalities converge to a shared latent structure of the world. However, this hypothesis has largely been examined in vision and language, and it remains unclear whether time series participate in such convergence. We first examine this in a trimodal setting and find that independently pretrained time series, vision, and language encoders exhibit near-orthogonal geometry in the absence of explicit coupling. We then apply post-hoc alignment by training projection heads over frozen encoders using contrastive learning, and analyze the resulting representations with respect to geometry, scaling behavior, and dependence on information density and input modality characteristics. Our investigation reveals that overall alignment in contrastive representation spaces improves with model size, but this alignment is asymmetric: time series align more strongly with visual representations than with text, and images can act as effective intermediaries between time series and language. We further see that richer textual descriptions improve alignment only up to a threshold; training on denser captions does not lead to further improvement. Analogous effects are observed for visual representations. Our findings shed light on considerations for building multimodal systems involving non-conventional data modalities beyond vision and language.

研究动机与目标

评估时间序列表示是否能够与视觉和语言在共享潜在空间中对齐。
表征在对比学习下三模态表示的几何形状和缩放行为。
识别推动或限制跨模态对齐的因素，跨越模态与数据集。
考察信息密度、锚定（ grounding ）以及模态互补性在对齐中的作用。
为涉及时间序列数据的多模态系统设计原则提供参考。

提出的方法

使用类似 CLIP 的框架，固定单模态编码器（时间序列、图像、文本），并训练投射头将其投射到共享空间。
对所有模态对（TS–IMG、TS–TXT、IMG–TXT）应用对称的跨模态 InfoNCE 损失并使用多种指标进行评估。
在34种配置和26种编码器组合上扩展模型容量以研究对齐趋势。
通过文本的标题变体改变信息密度以评估语义明确性效应。
在 CaTS-Bench 及额外数据集（TRUCE、MIMIC、PTB-XL）上进行评估以测试鲁棒性和间接文本监督的影响。
用余弦边界、Recall@k、Procrustes 差异、CK A 指标以及互相 k-NN 重叠等指标分析对齐。

Figure 1 : Trimodal projections of a shared temporal process. A latent process $Z$ gives rise to a numeric time series, a visual line plot, and a textual description, each representing the same signal in values, geometry, and language. Modality-specific encoders $f_{\text{ts}}$ , $f_{\text{img}}$ ,

实验结果

研究问题

RQ1随着模型规模的扩大，对齐在时间序列、视觉和语言之间是否均匀提升？
RQ2时间序列对视觉与对语言对齐之间的非对称性如何表现，原因何在？
RQ3文本信息密度如何影响跨模态对齐，是否存在饱和？
RQ4间接文本监督和语言变化对对齐有何影响？
RQ5更丰富的视觉输入或三模态设置是否能缓解弱对齐的情况？

主要发现

模型规模提升时对齐改善，但收敛性呈现非对称性：TS–IMG 的对齐优于 TS–TXT，整体邻域层面的对齐仍然较弱。
联合预训练的 VL 模型能够实现强的 IMG–TXT 对齐，并能在对规模依赖较少的情况下转移到三模态设置。
增加文本信息密度在一个阈值内提升对齐，超过该阈值后进一步增加收益有限。
CaTS 的标题与信号结构直接相关的文本对齐强于 MIMIC；间接文本监督会降低对齐，尤其在 TS–TXT 和 IMG–TXT 中更明显。
增加图像模态显著提升 TS–TXT 的对齐，而在已经强大的 TS–IMG 对中再加入第三模态可能因为优化复杂度而降低性能。
更丰富的视觉输入（如带注释的 TRUCE 图）持续提升 TS–IMG 对齐，且更大模型放大这些增益。

Figure 2 : Mean angular deviation between pretrained cross-modal representations on CaTS shows little inherent alignment.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。