QUICK REVIEW

[论文解读] MTEB: Massive Text Embedding Benchmark

Niklas Muennighoff, Nouamane Tazi|arXiv (Cornell University)|Oct 13, 2022

Topic Modeling被引用 62

一句话总结

MTEB 基准测试覆盖 58 个数据集，跨 8 个嵌入任务涵盖 112 种语言，评估 33 个模型，以绘制优势/劣势并显示在所有任务中不存在普遍最优。

ABSTRACT

Text embeddings are commonly evaluated on a small set of datasets from a single task not covering their possible applications to other tasks. It is unclear whether state-of-the-art embeddings on semantic textual similarity (STS) can be equally well applied to other tasks like clustering or reranking. This makes progress in the field difficult to track, as various models are constantly being proposed without proper evaluation. To solve this problem, we introduce the Massive Text Embedding Benchmark (MTEB). MTEB spans 8 embedding tasks covering a total of 58 datasets and 112 languages. Through the benchmarking of 33 models on MTEB, we establish the most comprehensive benchmark of text embeddings to date. We find that no particular text embedding method dominates across all tasks. This suggests that the field has yet to converge on a universal text embedding method and scale it up sufficiently to provide state-of-the-art results on all embedding tasks. MTEB comes with open-source code and a public leaderboard at https://github.com/embeddings-benchmark/mteb.

研究动机与目标

为跨越多样任务和语言的文本嵌入提供广泛、标准化的评估框架。
评估自监督和有监督嵌入模型的迁移性与普遍适用性。
量化性能、效率和多语言性，以指导不同嵌入使用场景的模型选择。

提出的方法

定义 8 种嵌入任务类型（双语文本挖掘、分类、聚类、对比分类、重排序、检索、STS、摘要）。
在统一评估流程中聚合 112 种语言和 58 个数据集，使用固定嵌入的余弦相似度。
以一致的预处理和评估器对 33 个模型（开源与 API 基础）进行基准测试；比较准确度、相关性、MRR、MAP、nDCG 等。
提供开源工具和公开排行榜，便于最少代码量（不到 10 行）新增新模型/数据集。
分析规模（模型大小）、效率（延迟/吞吐量）以及跨任务的多语言性能。

实验结果

研究问题

RQ1在 MTEB 中哪些嵌入模型在哪些任务上表现最佳？
RQ2自监督模型是否在所有任务上缩小与有监督模型的差距？
RQ3模型大小如何影响不同任务的性能与效率？
RQ4多语言预训练对跨语言与多语言任务的影响如何？
RQ5是否存在在大多数嵌入任务上占优的通用嵌入模型？

主要发现

分类	聚类	对配对分类	重排	检索	STS	摘要	平均
ST5-Base	69.81	40.21	85.17	53.09	33.63	81.14	31.39	55.27
ST5-Large	72.31	41.65	84.97	54.00	36.71	81.83	29.64	57.06
ST5-XL	72.84	42.34	86.06	54.71	38.47	81.66	29.91	57.87
ST5-XXL	73.42	43.71	85.06	56.43	42.24	82.63	30.08	59.51
GTR-XXL	67.41	42.42	86.12	56.65	48.48	78.38	30.64	58.97
GTR-Large	67.14	41.60	85.33	55.36	47.42	77.80	29.50	58.28
GTR-XL	67.11	41.51	86.13	55.96	47.96	77.80	30.21	58.42
MPNet	65.07	43.69	83.04	59.36	43.81	80.28	27.49	57.78
MPNet-multilingual	67.91	38.40	80.81	53.80	35.34	80.73	31.57	54.71
OpenAI Ada Similarity	70.44	37.52	76.86	49.02	18.36	78.60	26.94	49.52

没有单一嵌入方法在所有任务中占主导；性能因任务和数据集而异。
模型大小通常与性能相关；具有数十亿参数的模型在许多英语任务中占优，但成本更高。
检索任务偏好为针对不对称文本（查询与文档）进行训练或微调的模型，而 STS 类任务偏好对称嵌入；为一个任务优化的模型并不能保证其他任务的表现。
ST5-XXL 拥有最高的英文平均分，但 GTR-XXL 和 MPNet 变体在特定任务中也表现出色；效率与任务适配仍然是模型选择的关键。
双语文本挖掘被 LaBSE 主导；聚类在像 MPNet 这样的较小模型也具有竞争力；多语言表现因语言和数据集而异。
多语言 MPNet 常提供强大的多语言/分类/STS 结果，而 SGPT-BLOOM-7B1-msmarco 在预训练阶段见到的语言上表现出色。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。