QUICK REVIEW

[論文レビュー] AnglE-optimized Text Embeddings

Xianming Li, Jing Li|arXiv (Cornell University)|Sep 22, 2023

Topic Modeling被引用数 21

ひとこと要約

AnglEは複素空間で角度最適化を導入し、テキスト埋め込み学習におけるコサイン飽和を緩和する。新しいGitHub Issues長文STSデータセットを用い、転送・非転送STSタスクで最先端の結果を達成し、下流の検索性能を向上させる。

ABSTRACT

High-quality text embedding is pivotal in improving semantic textual similarity (STS) tasks, which are crucial components in Large Language Model (LLM) applications. However, a common challenge existing text embedding models face is the problem of vanishing gradients, primarily due to their reliance on the cosine function in the optimization objective, which has saturation zones. To address this issue, this paper proposes a novel angle-optimized text embedding model called AnglE. The core idea of AnglE is to introduce angle optimization in a complex space. This novel approach effectively mitigates the adverse effects of the saturation zone in the cosine function, which can impede gradient and hinder optimization processes. To set up a comprehensive STS evaluation, we experimented on existing short-text STS datasets and a newly collected long-text STS dataset from GitHub Issues. Furthermore, we examine domain-specific STS scenarios with limited labeled data and explore how AnglE works with LLM-annotated data. Extensive experiments were conducted on various tasks including short-text STS, long-text STS, and domain-specific STS tasks. The results show that AnglE outperforms the state-of-the-art (SOTA) STS models that ignore the cosine saturation zone. These findings demonstrate the ability of AnglE to generate high-quality text embeddings and the usefulness of angle optimization in STS.

研究の動機と目的

LLMアプリケーションのためのSTSにおける高品質なテキスト埋め込みの必要性を動機づける。
教師ありSTS学習における主要な勾配問題としてコサイン飽和を特定する。
角度を複素空間で最適化して飽和効果を緩和する AnglE の提案。
GitHub Issues からの新しい長文STSデータセットで評価を拡張する。
ドメインデータの不足に対処するためのLLM監督学習を検討し、検索への影響を評価する。

提案手法

入力文は固定長にパディングされ、バックボーン（BERT/ RoBERTa/ LLaMA）を用いて文脈表現を得るようにエンコードされる。
標準のコサイン目的関数を最適化し、温度τを用いたクロスエントロピー風の損失で類似性ランクを最大化する。
バッチ内ネガティブを活用するインバッチネガティブ目的と、バッチ内の重複をポジティブとして特定してノイズを低減する。
複素空間におけるペア表現間の絶対正規化角度差を計算して角度目的を導入し、コサイン飽和を緩和する。
3つの目的を組み合わせて最終損失を得る：L = w1*L_cos + w2*L_ibn + w3*L_angle、重みと温度は調整可能。
確立されたSTSベンチマーク（短文・長文）と新たに提案されたGitHub Issues Similarity Datasetを用いて評価する。

Figure 1: The saturation zones of the cosine function. The gradient at saturation zones is close to zero. During backpropagation, if the gradient is very small, it could kill the gradient and make the network difficult to learn.

実験結果

リサーチクエスチョン

RQ1複素空間での角度最適化はコサインのみの目的よりSTS埋め込みを改善できるか？
RQ2AnglEは転送・非転送STS設定の双方で堅牢な改善を提供し、特に長文データに有効か？
RQ3限られた教師データとLLM監督ラベリングによる増強時のAnglEの性能は？
RQ4角度最適化が検索および下流タスクに与える影響は何か？

主な発見

モデル	STS12	STS13	STS14	STS15	STS16	STS-B	SICR-R	平均
AnglE-LLaMA2-7B	79.00	90.56	85.79	89.43	87.00	88.97	80.94	85.96
AnglE-BERT	75.09	85.56	80.66	86.44	82.47	85.16	81.23	82.37

AnglEはコサイン飽和を無視する最先端STSモデルを転送・非転送設定の双方で上回る。
転送STSでは、AnglE-BERTとAnglE-LLaMA2-7Bは、それぞれSimCSEベースラインに対して平均0.80%、0.72%の改善を達成。
非転送STSでは、AnglEはSTSタスクで平均Spearman相関が73.55%（AnglE-BERT）および85.96%（AnglE-LLaMA2-7B）を達成し、SBERTおよびC/Sベースラインを上回る。
新たに収集されたGitHub Issues長文STSデータセットは、長文（512トークン超）の長文STSの評価に適していることを示す。
アブレーション結果は、角度目的が性能に大きく寄与し、しばしばインバッチネガティブ成分より影響力が大きいことを示す。
LLM監督付きAnglE（例：ChatGPT/LLaMA/ChatGLM）はSTS性能をさらに改善し、アンサンブル結果が教師なしのベースラインを上回る。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。