QUICK REVIEW

[论文解读] Large Concept Models: Language Modeling in a Sentence Representation Space

the KSS Cave Studies Team, Loïc Barrault|arXiv (Cornell University)|Dec 11, 2024

Topic Modeling被引用 10

一句话总结

本论文提出大型概念模型（LCMs），在固定的句子嵌入空间（SONAR）中进行自回归生成，使语言和模态无关的推理成为可能，并具备强大的零-shot 多语言泛化能力。它比较了基础、扩散式和量化变体，扩展到 7B 参数，并进行了多语言评估，公开训练代码及编码器/解码器。

ABSTRACT

LLMs have revolutionized the field of artificial intelligence and have emerged as the de-facto tool for many tasks. The current established technology of LLMs is to process input and generate output at the token level. This is in sharp contrast to humans who operate at multiple levels of abstraction, well beyond single words, to analyze information and to generate creative content. In this paper, we present an attempt at an architecture which operates on an explicit higher-level semantic representation, which we name a concept. Concepts are language- and modality-agnostic and represent a higher level idea or action in a flow. Hence, we build a "Large Concept Model". In this study, as proof of feasibility, we assume that a concept corresponds to a sentence, and use an existing sentence embedding space, SONAR, which supports up to 200 languages in both text and speech modalities. The Large Concept Model is trained to perform autoregressive sentence prediction in an embedding space. We explore multiple approaches, namely MSE regression, variants of diffusion-based generation, and models operating in a quantized SONAR space. These explorations are performed using 1.6B parameter models and training data in the order of 1.3T tokens. We then scale one architecture to a model size of 7B parameters and training data of about 2.7T tokens. We perform an experimental evaluation on several generative tasks, namely summarization and a new task of summary expansion. Finally, we show that our model exhibits impressive zero-shot generalization performance to many languages, outperforming existing LLMs of the same size. The training code of our models is freely available.

研究动机与目标

通过在语言和模态无关的嵌入空间中操作，超越仅限标记的抽象层级推理。
评估句子表示是否能够支持连贯的长文生成与跨语言迁移。
证明在 SONAR 嵌入中跨多种架构进行自回归生成的可行性。
评估 SONAR 编码器/解码器支持的语言和模态下的零-shot 泛化能力。
提供开源训练代码和组件，推动基于概念的建模的进一步研究。

提出的方法

使用 SONAR 作为现成的、语言与模态无关的句子嵌入空间，将输入表示为概念（句子）的序列。
训练 Large Concept Models (LCMs) 以在嵌入空间中自回归地预测下一个概念，使用 MSE 或扩散式目标，并探索量化变体。
研究三种 LCM 变体：一个具有标准解码器-变换器架构的 Base-LCM；一个单骨干的 One-Tower 扩散 LCM；Two-Tower 扩散 LCM，结合上下文化器和去噪器。
研究扩散的多种噪声调度，包括余弦、二次，以及新引入的 sigmoid 调度，在推理阶段应用无分类器的扩散引导和 Epsilon 缩放。
评估停止条件和通过 SONAR 解码器的解码，以输出多语言/多模态输出而无需重新训练 LCM。
发布用于 LCM 训练和 SONAR 编码器/解码器的开源代码。

实验结果

研究问题

RQ1自回归模型是否能够在固定的句子嵌入空间中有效地生成连贯的长篇内容？
RQ2扩散式和量化方法在嵌入空间生成的质量和多样性方面能达到多大程度的提升？
RQ3与同等规模的基于标记的大型语言模型相比，LCM 在零-shot 多语言生成中的表现如何？
RQ4分层、以概念为中心的架构在长上下文推理和模态多样输出方面有哪些好处？
RQ5在多语言下的句子分割与基于嵌入的生成方面，实际面临的挑战和权衡是什么？

主要发现

LCMs 可以在 SONAR 嵌入支持的语言和模态上进行零-shot 生成。
探索扩散式和量化变体，以建模连续句子嵌入的条件分布。
一个7B参数的扩散 LCM 在大规模数据上训练，展现出与同类模型相当的能力。
该架构通过在更高层次的概念上而非标记上进行操作，实现长篇的分层推理。
基于 SONAR 的编码器/解码器实现广泛的语言覆盖（文本 200 种语言，语音 76 种语言）以及其他模态。
作者公开训练代码和 SONAR 组件，供社区使用。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。