QUICK REVIEW

[論文レビュー] Large Concept Models: Language Modeling in a Sentence Representation Space

the KSS Cave Studies Team, Loïc Barrault|arXiv (Cornell University)|Dec 11, 2024

Topic Modeling被引用数 10

ひとこと要約

本論文は、Large Concept Models (LCMs) を導入し、固定された文埋め込み空間（SONAR）で自己回帰生成を行うことで、言語およびモダリティに依存しない推論と強力なゼロショット多言語一般化を可能にします。ベース、拡散ベース、および量子化 variant を比較し、7B パラメータへとスケーリングし、多言語評価を実施し、訓練コードとエンコーダ/デコーダを公開します。

ABSTRACT

LLMs have revolutionized the field of artificial intelligence and have emerged as the de-facto tool for many tasks. The current established technology of LLMs is to process input and generate output at the token level. This is in sharp contrast to humans who operate at multiple levels of abstraction, well beyond single words, to analyze information and to generate creative content. In this paper, we present an attempt at an architecture which operates on an explicit higher-level semantic representation, which we name a concept. Concepts are language- and modality-agnostic and represent a higher level idea or action in a flow. Hence, we build a "Large Concept Model". In this study, as proof of feasibility, we assume that a concept corresponds to a sentence, and use an existing sentence embedding space, SONAR, which supports up to 200 languages in both text and speech modalities. The Large Concept Model is trained to perform autoregressive sentence prediction in an embedding space. We explore multiple approaches, namely MSE regression, variants of diffusion-based generation, and models operating in a quantized SONAR space. These explorations are performed using 1.6B parameter models and training data in the order of 1.3T tokens. We then scale one architecture to a model size of 7B parameters and training data of about 2.7T tokens. We perform an experimental evaluation on several generative tasks, namely summarization and a new task of summary expansion. Finally, we show that our model exhibits impressive zero-shot generalization performance to many languages, outperforming existing LLMs of the same size. The training code of our models is freely available.

研究の動機と目的

トークンを超えた抽象レベルでの推論を促進するため、言語およびモダリティに依存しない埋め込み空間で操作する。
文の表現が、一貫した長文生成と跨言語転移を支えるかを評価する。
複数のアーキテクチャにわたって SONAR 埋め込みでの自己回帰生成の実現可能性を示す。
SONAR のエンコーダ/デコーダがサポートする言語とモダリティ全般におけるゼロショット一般化能力を評価する。
概念ベースのモデリングにおけるさらなる研究を促進するオープンソースの訓練コードとコンポーネントを提供する。

提案手法

既存の言語・モダリティに依存しない文埋め込み空間として SONAR を使用し、入力を概念（文）の連なりとして表現する。
埋め込み空間で次の概念を自己回帰的に予測するよう LCMs を訓練し、MSE または拡散ベースの目的関数を用い、量子化バリアントを探索する。
3 種類の LCM バリアントを検討する。標準のデコーダ–トランスフォーマーアーキテクチャを備えた Base-LCM; 単一バックボーンを持つ One-Tower 拡散 LCM; 文脈化器とデノイザーを組み合わせた Two-Tower 拡散 LCM。
拡散のノイズスケジュールをコサイン、二次、そして新たに導入されたシグモイドスケジュールを含めて複数検討し、推論時には classifier-free 拡散ガイダンスと Epsilon-scaling を適用する。
LCM の再訓練を行わずに、SONAR デコーダを介して停止基準とデコードを評価し、さまざまな言語/モダリティ出力を出力する。
LCM 訓練と SONAR エンコーダ/デコーダのオープンソースコードを公開する。

実験結果

リサーチクエスチョン

RQ1固定された文埋め込み空間で自己回帰モデルが効果的に動作し、一貫した長文コンテンツを生成できるか？
RQ2埋め込み空間生成において、拡散ベースおよび量子化アプローチが品質と多様性をどの程度改善しうるか？
RQ3同程度のサイズのトークンベースLLMと比較して、LCM はゼロショットの多言語生成でどのように機能するか？
RQ4長文脈推論とモダリティ豊富な出力における階層的かつ概念中心のアーキテクチャの利点は何か？
RQ5多言語にわたる文分割と埋め込みベース生成における実践的な課題とトレードオフは何か？

主な発見

LCMs は SONAR 埋め込みでサポートされる言語とモダリティ全般でゼロショット生成を実行できる。
拡散ベースおよび量子化バリアントを探索し、連続的な文埋め込み上の条件付き分布をモデリングする。
7B パラメータの拡散 LCM を大規模データで訓練したところ、同規模の既存モデルと比較して競争力のある能力を示す。
このアーキテクチャは、トークンではなくより高レベルの概念上で操作することで長文の階層的推論を可能にする。
SONAR ベースのエンコーダ/デコーダは、広範な言語カバレッジ（テキスト200言語、音声76言語）および追加モダリティを可能にする。
著者らはコミュニティ利用のために訓練コードと SONAR コンポーネントを公開する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。