QUICK REVIEW

[논문 리뷰] M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

J.B. Chen, Shitao Xiao|arXiv (Cornell University)|2024. 02. 05.

Natural Language Processing Techniques인용 수 46

한 줄 요약

M3-Embedding은 100+ 언어를 지원하고, 다중 검색 기능(dense, sparse, multi-vector)을 제공하며, 최대 8192 토큰에 이르는 긴 입력 범위를 처리하는 다목적 텍스트 임베딩 모델을 제시합니다. 이 모델은 self-knowledge distillation과 효율적인 배치를 통해 학습됩니다.

ABSTRACT

In this paper, we introduce a new embedding model called M3-Embedding, which is distinguished for its versatility in extit{Multi-Linguality}, extit{Multi-Functionality}, and extit{Multi-Granularity}. It provides a uniform support for the semantic retrieval of more than 100 working languages. It can simultaneously accomplish the three common retrieval functionalities: dense retrieval, multi-vector retrieval, and sparse retrieval. Besides, it is also capable of processing inputs of different granularities, spanning from short sentences to long documents of up to 8,192 tokens. The effective training of M3-Embedding presents a series of technical contributions. Notably, we propose a novel self-knowledge distillation approach, where the relevance scores from different retrieval functionalities can be integrated as the teacher signal to enhance the training quality. We also optimize the batching strategy, which enables a large batch size and high training throughput to improve the discriminativeness of embeddings. M3-Embedding exhibits a superior performance in our experiment, leading to new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.

연구 동기 및 목표

다양한 언어에 걸쳐 작동하는 하나의 다목적 텍스트 임베딩 모델의 필요성을 해결한다.
하나의 모델에서 dense, sparse, multi-vector 등 다양한 검색 기능을 가능하게 한다.
짧은 문장부터 긴 문서에 이르는 입력을 처리한다(최대 8192 토큰).
이질적인 검색 신호를 통합하기 위해 self-knowledge distillation을 활용하는 학습 프레임워크를 제안한다.
최적화된 배치 처리와 고처리량 데이터 선별로 학습 효율성을 향상시킨다.

제안 방법

dense, sparse, multi-vector 검색을 하나의 통합 프레임워크에서 지원하는 단일 임베딩 모델을 도입한다.
dense 검색에는 [CLS] 토큰을, sparse 및 multi-vector 검색에는 다른 토큰 임베딩을 사용한다.
이질적인 검색 신호로부터의 예측을 하나의 teacher 신호로 융합하기 위해 self-knowledge distillation을 제안한다.
훈련 및 미세 조정을 위해 대규모 다중 소스 다국어 데이터셋(비지도, 지도, 합성)을 활용한다.
대규모 배치와 긴 입력 처리를 가능하게 하도록 배치 처리 및 데이터 처리를 최적화하고, 긴 문서에 대한 MCLS 추론 전략을 구현한다.

실험 결과

연구 질문

RQ1단일 임베딩 모델이 여러 언어와 검색 패러다임에서 최첨단 성능을 달성할 수 있는가?
RQ2self-knowledge distillation을 사용하여 dense, sparse, 및 multi-vector 검색 신호를 공동으로 학습시키려면 어떻게 해야 하는가?
RQ3긴 문서 검색과 넓은 입력 입자 범위를 지원하기 위해 어떤 데이터 및 학습 전략이 필요한가?
RQ4효율적인 배치 처리가 임베딩의 판별력을 희생하지 않으면서 고처리량 훈련을 가능하게 하는가?
RQ5M3-Embedding의 다국어 및 교차 언어 벤치마크인 MIRACL 및 MKQA에서의 비교 성능은 어떤가?

주요 결과

M3-Embedding은 다국어 및 교차 언어 검색 성능이 우수하며 MIRACL 및 MKQA 벤치마크에서 최첨단 결과를 달성한다.
모델은 세 가지 검색 기능(dense, sparse, multi-vector)을 학습하고, 이들의 결합이 더 우수한 검색 품질에 기여한다.
8192 토큰까지의 입력 세분성에서도 견고한 성능을 유지하며 길이 문서 검색 벤치마크에서 많은 베이스라인을 상회한다.
모든 검색 신호의 점수를 통합하는 self-knowledge distillation은 학습 효율성과 임베딩 품질을 향상시킨다.
효율적인 배치 전략과 고품질 데이터 큐레이션은 높은 학습 처리량과 판별력이 있는 임베딩에 기여한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.