QUICK REVIEW

[论文解读] Unified Vision-Language Modeling via Concept Space Alignment

Yifu Qiu, Paul-Ambroise Duquenne|arXiv (Cornell University)|Mar 1, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

论文提出 v-Sonar，对 Vision 编码器进行 Post-hoc 对 Sonar 多语言嵌入空间的对齐，从而实现视觉–语言任务，并在该空间引入 v-LCM 进行潜在扩散式的视觉–语言建模。

ABSTRACT

We introduce V-SONAR, a vision-language embedding space extended from the text-only embedding space SONAR (Omnilingual Embeddings Team et al., 2026), which supports 1500 text languages and 177 speech languages. To construct V-SONAR, we propose a post-hoc alignment pipeline that maps the representations of an existing vision encoder into the SONAR space. We thoroughly evaluate V-SONAR and show that its embeddings achieve competitive performance on text-to-video retrieval. Equipped with the OMNISONAR text decoder, V-SONAR further surpasses state-of-the-art vision-language models on video captioning tasks, including DREAM-1K (BLEU 23.9 vs. 19.6) and PE-VIDEO (BLEU 39.0 vs. 30.0). Leveraging V-SONAR, we first demonstrate that the Large Concept Model (LCM; LCM team et al. 2024) operating in SONAR and trained with English text only, can perform both single- and multi-visual concept understanding in a zero-shot manner. Finally, we introduce V-LCM, which extends the LCM with vision-language instruction tuning. V-LCM encodes vision and language inputs into an unified sequence of latent embeddings via V-SONAR and SONAR, and it is trained with the same latent diffusion objective for next-embedding prediction as in LCM's text-only pre-training. Experiments on a large-scale multilingual and -modal instruction-tuning data mixture highlight the potential of V-LCM: V-LCM matches state-of-the-art vision-language models on tasks covering image/video captioning and question answering, while significantly outperforming them across 61 rich- to low-resource languages out of all 62 tested languages.

研究动机与目标

目标是创建一个将 Sonar 扩展到图像和视频模态的视觉–语言嵌入空间。
通过将视觉编码器与 Sonar 在一个粗至细的课程中对齐，实现在零样本和多语言视觉–语言任务。
证明 LCM 能在 Sonar/v-Sonar 的潜在空间中进行视觉–语言推理与指令微调（instruction-tuning）。
展示所得模型在视频检索、字幕生成和多语言任务中达到具有竞争力或最先进的结果。

提出的方法

对 Perception 编码器与 Sonar 进行轻量投影仪的后置对齐。
三阶段课程：12M 图像–字幕对用于粗粒度的 grounding，2M 伪视频–字幕对用于时间适配，20万高质量视频字幕用于细粒度对齐。
使用均方误差损失在 Sonar 空间对齐视觉与文本嵌入，保持 Sonar 冻结，仅更新投影仪/视觉编码器。
比较线性投影与全量微调，并采用带有架构与数据消融的渐进式训练设置。
将 Sonar 扩展为 OmniSONAR 并展示更优的多语言嵌入质量；评估嵌入空间的属性（迹、对数行列式）。
LCM 在 Sonar 空间中工作，使用扩散目标来预测在上下文嵌入条件下的下一个嵌入（两塔变体）。
通过视觉–语言指令微调引入 v-LCM，将 v-Sonar 的视觉嵌入与 Sonar 的文本嵌入拼接，并在潜在空间中预测下一个嵌入。

实验结果

研究问题

RQ1一个视觉编码器能否有效地对后置对齐到语言无关的嵌入空间（Sonar）以支持视觉–语言任务？
RQ2三阶段课程是否提升对齐质量以及在多语言数据上的检索与字幕生成的下游表现？
RQ3Large Concept Model（LCM）能否在 Sonar 潜在空间实现零-shot，且视觉–语言指令微调（v-LCM）是否在多语言 VLM 表现上表现强劲？
RQ4与最先进的视觉–语言模型相比，v-LCM 在图像/视频字幕、VQA 和多语言基准上的表现如何？

主要发现

v-Sonar 在 PE-Video、Vatex 和 Dream-1k 上实现了具有竞争力的零样本文本到视频检索。
v-Sonar 与 Sonar 解码器组合，在 PE-Video、Dream-1k、Vatex 数据集上实现了最先进或强劲的视频字幕生成结果。
LCM 在对齐到 Sonar 时，可以在零样本设置下对视觉任务进行单目标与多目标的视觉理解。
v-LCM 使用视觉–语言指令微调数据（M3IT）进行训练，在字幕生成和问答任务上达到或超过若干 VLM 基线，在多语言评估的62种语言中领先61种。
在 M3IT 的 62 种语言中，v-LCM 在大多数语言中优于 Qwen2.5-VL-7B 与 PLM-8B，尤其在中低资源语言上有显著提升。
v-LCM 在视频问答任务（IVQA、ActivityNetQA、MSRVTT-QA）上表现出强劲，并在视频字幕与摘要任务上具有竞争力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。