QUICK REVIEW

[論文レビュー] Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents

Michael Günther, Jackmin Ong|arXiv (Cornell University)|Oct 30, 2023

Topic Modeling被引用数 10

ひとこと要約

Jina Embeddings v2 はオープンソースの BERT ベースのエンコーダを導入し、最大 8192 トークンまでエンコード可能、長文埋め込みを改善し、MTEB で最先端の検索性能に匹敵しつつ GLUE の結果も堅牢に維持。

ABSTRACT

Text embedding models have emerged as powerful tools for transforming sentences into fixed-sized feature vectors that encapsulate semantic information. While these models are essential for tasks like information retrieval, semantic clustering, and text re-ranking, most existing open-source models, especially those built on architectures like BERT, struggle to represent lengthy documents and often resort to truncation. One common approach to mitigate this challenge involves splitting documents into smaller paragraphs for embedding. However, this strategy results in a much larger set of vectors, consequently leading to increased memory consumption and computationally intensive vector searches with elevated latency. To address these challenges, we introduce Jina Embeddings 2, an open-source text embedding model capable of accommodating up to 8192 tokens. This model is designed to transcend the conventional 512-token limit and adeptly process long documents. Jina Embeddings 2 not only achieves state-of-the-art performance on a range of embedding-related tasks in the MTEB benchmark but also matches the performance of OpenAI's proprietary ada-002 model. Additionally, our experiments indicate that an extended context can enhance performance in tasks such as NarrativeQA.

研究の動機と目的

固定長の埋め込みによる truncation や過剰なベクトル増殖なしに長文を表現するという課題に対処する。
修正された BERT バックボーンを基に、8192-トークン対応のエンコーダモデルファミリーを開発・微調整する。
ALiBi 双方向アテンションを活用し、従来の位置エンベディングなしで長文コンテキストをエンコード可能にする。
標準ベンチマークでの検索、クラスタリング、長文タスクにわたる埋め込みの有効性を示す。
モデルとデータセットを Hugging Face を通じて広くアクセス可能にする。

提案手法

ALiBi 双方向アテンションをエンコーダに組み込み、標準の位置エンベディングを置換して最大 8192 トークンをサポートする BERT ライクなバックボーンを改良。
英語 C4 コーパスで MLM のみの事前学習を実施し NSP なし、全語 masked を使用しマスク率を 30% に設定。
二段階の埋め込み微調整： (a) テキスト対コントラスト学習を用いて平均プーリングで単一ベクター表現を作成; (b) ハードネガティブを用いた監視型微調整でランキングと検索性能を向上。
Pairing とクロス・ペア方向の InfoNCE ベースの損失を使用、温度 τ = 0.05、双方向目的。
大規模バッチ訓練と混合精度、DeepSpeed、メモリ管理のためのアクティベーションチェックポイントを採用。

Figure 1: With ALiBi attention, a linear bias is incorporated into each attention score preceding the softmax operation. Each attention head employs a distinct constant scalar, $m$ , which diversifies its computation. Our model adopts the encoder variant where all tokens mutually attend during calcu

実験結果

リサーチクエスチョン

RQ1ALiBi ベースの双方向アテンションは 512 トークンの切り捨てなしに長文の bi-encoder 埋め込みを可能にするか？
RQ28192-トークン埋め込みは以前のオープンソースモデルより MTEB ベンチマークで優れた成績を出し、OpenAI の ada-002 の性能に近づくか？
RQ3文脈長を増やすと NarrativeQA や長文クラスタリング/検索などの下流タスクにどのような影響を与えるか？
RQ42 段階の微調整（テキスト対とハードネガティブ）を行うと検索と非検索タスクにどのような影響があるか？
RQ5Jina Embeddings v2 モデルはオープンソースで Hugging Face 経由で利用可能かつベンチマーク全体で競争力の性能を発揮するか？

主な発見

Model	Params	MNLI	QQP	QNLI	SST-2	CoLa	STS-B	MRPC	RTE	WNLI	Average
BERT Base	110M	84.6/83.4	71.2	90.5	93.5	52.1	85.8	88.9	66.4	-	-
BERT Large	340M	86.7/85.9	72.1	92.7	94.9	60.5	86.5	89.3	70.1	-	-
RoBERTa	355M	90.8/90.2	90.2	98.9	96.7	67.8	92.2	92.3	88.2	89.0	88.5
Jina BERT Small	33M	80.1/78.9	78.9	86.0	89.6	28.8	84.8	84.1	68.8	55.5	72.9
Jina BERT Base	137M	85.7/85.4	80.7	92.2	94.5	51.4	89.5	88.4	78.7	65.1
Jina BERT Large	435M	86.6/85.9	80.9	92.5	95.0	59.6	88.2	88.5	78.5	65.1

8192-トークンの Jina BERT ベースのエンコーダは複数の MTEB タスクで最先端の結果を達成し、ベンチマークで ada-002 に匹敵する。
ALiBi 双方向アテンションは位置エンベディングなしで長文コンテキストのエンコードを可能にし、MLM の精度を 8192 トークンまで維持。
ハードネガティブを用いた長文コンテキスト微調整は検索重視タスクでの検索とランキングの性能を向上。
大規模文脈評価は NarrativeQA のような物語や長文クラスタリングタスクで性能が向上を示すが、文書構造に応じて混合効果もある。
モデルとデータセットは Hugging Face で公開され、オープンアクセス。

Figure 2: Variation of model MLM accuracy w.r.t. the sequence length

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。