QUICK REVIEW

[论文解读] Improving Text Embeddings with Large Language Models

Liang Wang, Nan Yang|arXiv (Cornell University)|Dec 31, 2023

Topic Modeling被引用 13

一句话总结

作者使用专有的大语言模型生成多样的合成数据来训练开源解码器模型以获得文本嵌入，在训练步数不足1k且无需标注数据的情况下取得强劲结果；并在使用合成数据加上少量标注数据时，在 BEIR 和 MTEB 上达到最先进水平。

ABSTRACT

In this paper, we introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. Unlike existing methods that often depend on multi-stage intermediate pre-training with billions of weakly-supervised text pairs, followed by fine-tuning with a few labeled datasets, our method does not require building complex training pipelines or relying on manually collected datasets that are often constrained by task diversity and language coverage. We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages. We then fine-tune open-source decoder-only LLMs on the synthetic data using standard contrastive loss. Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data. Furthermore, when fine-tuned with a mixture of synthetic and labeled data, our model sets new state-of-the-art results on the BEIR and MTEB benchmarks.

研究动机与目标

在没有多阶段管线或大量标注数据的前提下，推动改进文本嵌入的研究。
提出一种简单的合成数据生成管线，使用多语言和多任务的 LLM。
证明在合成数据上对开源 LLM 进行微调可产生具有竞争力的嵌入。
在结合标注数据时，在 BEIR 和 MTEB 上达到最先进的结果。
讨论该方法的多语言性与长上下文能力。

提出的方法

使用专有 LLM 在 93 种语言中头脑风暴并生成特定任务的合成（三元组：查询、正例、难负样本）。
采用两步提示策略：先头脑风暴任务池，再在任务定义条件下生成数据。
在合成数据上（若可用还包括 MS MARCO）对开源解码器型 LLM（Mistral-7B）使用 InfoNCE 对比损失进行微调。
使用预训练 LLM 的最后一个令牌嵌入作为查询/文档表示，余弦相似度的温度 tau=0.02。
采用 LoRA，秩为 16，并使用梯度检查点、混合精度、DeepSpeed ZeRO-3 等训练技巧来实现 <1k 步。
通过 RoPE 旋转基底调整和合成的长上下文任务探索长上下文能力。

Figure 1: An example two-step prompt template for generating synthetic data with GPT-4. We first prompt GPT-4 to brainstorm a list of potential retrieval tasks, and then generate (query, positive, hard negative) triplets for each task. “ { … } ” denotes a placeholder that will be replaced by samplin

实验结果

研究问题

RQ1是否可以用 LLM 生成的合成数据在单阶段训练下学习出高质量的文本嵌入？
RQ2单独的合成数据与合成数据加标注数据对基准性能（BEIR、MTEB）有何影响？
RQ3多语言覆盖如何影响高资源和低资源语言的嵌入质量？
RQ4对比学习的预训练对高度预训练的 LLM 是否必要以获得良好嵌入？
RQ5嵌入是否能扩展到长上下文任务和个性化检索场景？

主要发现

数据集数量	# 的数据集	类别	聚类	配对类别	再次排序	检索	STS	摘要	平均
56	12	11	3	4	15	10	1	66.6
56	Unsupervised Models	Glove	57.3	27.7	70.9	43.3	21.6	61.9	28.9	42.0
56	SimCSE bert-unsup	62.5	29.0	70.3	46.5	20.3	74.3	31.2	45.5
56	Supervised Models	SimCSE bert-sup	67.3	33.4	73.7	47.5	21.8	79.1	23.3	48.7
56	Contriever	66.7	41.1	82.5	53.1	41.9	76.5	30.4	56.0
56	GTR xxl	67.4	42.4	86.1	56.7	48.5	78.4	30.6	59.0
56	Sentence-T5 xxl	73.4	43.7	85.1	56.4	42.2	82.6	30.1	59.5
56	E5 large-v2	75.2	44.5	86.0	56.6	50.6	82.1	30.2	62.3
56	GTE large	73.3	46.8	85.0	59.1	52.2	83.4	31.7	63.1
56	BGE large-en-v1.5	76.0	46.1	87.1	60.0	54.3	83.1	31.6	64.2
56	Ours E5 mistral-7b full data	78.5	50.3	88.3	60.2	56.9	84.6	31.4	66.6
56	Ours w/ synthetic data only	78.2	50.5	86.0	59.0	46.9	81.2	31.9	63.1
56	Ours w/ synthetic + msmarco	78.3	49.9	87.1	59.5	52.2	81.2	32.7	64.5

仅用合成数据进行训练就可在 MTEB 上取得具有竞争力的性能，且无需任何标注数据。
合成数据与标注数据混合微调在 BEIR 和 MTEB 基准上达到最先进的结果。
仅合成数据的模型平均 MTEB 得分为 63.1，而合成数据加 MS-MARCO 达到 64.5，全部数据达到 66.6。
E5_mistral-7b 使用全部数据在 MTEB 平均分上超越现有最优结果 2.4 点。
相较于其他架构，基于 Mistral-7B 的对比学习预训练对性能的提升较小，表明自回归预训练提供了强大的表征能力。
该方法可扩展到 93 种语言，并通过上下文长度扩展展示了长上下文能力，在高资源语言上获得最佳结果。

Figure 2: Task type and language statistics of the generated synthetic data (see Section 3.1 for task type definitions). The “Others” category contains the remaining languages from the XLM-R language list.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。