QUICK REVIEW

[논문 리뷰] Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents

Michael Günther, Jackmin Ong|arXiv (Cornell University)|2023. 10. 30.

Topic Modeling인용 수 10

한 줄 요약

Jina Embeddings v2는 오픈 소스 기반의 BERT 기반 인코더를 도입하여 최대 8192 토큰까지 인코딩 가능하며, 장문 임베딩을 개선하고 MTEB에서 최첨단 검색 성능에 근접하면서도 강력한 GLUE 결과를 유지합니다.

ABSTRACT

Text embedding models have emerged as powerful tools for transforming sentences into fixed-sized feature vectors that encapsulate semantic information. While these models are essential for tasks like information retrieval, semantic clustering, and text re-ranking, most existing open-source models, especially those built on architectures like BERT, struggle to represent lengthy documents and often resort to truncation. One common approach to mitigate this challenge involves splitting documents into smaller paragraphs for embedding. However, this strategy results in a much larger set of vectors, consequently leading to increased memory consumption and computationally intensive vector searches with elevated latency. To address these challenges, we introduce Jina Embeddings 2, an open-source text embedding model capable of accommodating up to 8192 tokens. This model is designed to transcend the conventional 512-token limit and adeptly process long documents. Jina Embeddings 2 not only achieves state-of-the-art performance on a range of embedding-related tasks in the MTEB benchmark but also matches the performance of OpenAI's proprietary ada-002 model. Additionally, our experiments indicate that an extended context can enhance performance in tasks such as NarrativeQA.

연구 동기 및 목표

잘라내기나 과도한 벡터 증가 없이 고정 크기 임베딩으로 긴 문서를 표현하는 문제를 해결한다.
수정된 BERT 백본을 기반으로 8192-토큰을 처리할 수 있는 인코더 계열을 개발하고 미세조정한다.
전통적인 위치 임베딩 없이도 긴 컨텍스트 인코딩을 가능하게 하려 ALiBi 양방향 어텐션을 활용한다.
일반 벤치마크에서 검색, 클러스터링 및 장문 작업에 걸친 임베딩의 유효성을 입증한다.
폭넓은 접근성을 위해 Hugging Face를 통해 모델과 데이터 세트를 공개한다.

제안 방법

표준 위치 임베딩을 대체하고 인코더에서 ALiBi 양방향 어텐션을 사용하여 최대 8192 토큰을 지원하도록 BERT 유사 백본을 수정한다.
전체 단어 마스킹을 이용하고 30% 마스킹 비율을 적용하며 NSP 없이 MLM으로 영어 C4 코퍼스로 사전 학습한다.
텍스트 페어 대비 학습을 통해 평균 풀링으로 단일 벡터 표현을 생성하는 미세조정; (b) 순위 및 검색 성능 향상을 위한 하드 네거티브 감독 미세조정.
쌍 매칭 및 교차 페어 방향에 InfoNCE 기반 손실을 사용하고 온도 τ = 0.05 및 양방향 목표를 적용한다.
메모리를 관리하기 위해 혼합 정밀도 및 DeepSpeed와 함께 대형 배치 학습을 도입하고 활성 체크포인팅을 사용한다.

Figure 1: With ALiBi attention, a linear bias is incorporated into each attention score preceding the softmax operation. Each attention head employs a distinct constant scalar, $m$ , which diversifies its computation. Our model adopts the encoder variant where all tokens mutually attend during calcu

실험 결과

연구 질문

RQ1ALiBi 기반 양방향 어텐션이 512토큰 잘림 없이 긴 문서의 바-인코더 임베딩을 가능하게 할 수 있는가?
RQ28192-토큰 임베딩이 MTEB 벤치마크에서 이전 오픈 소스 모델을 능가하고 OpenAI ada-002 성능에 근접하는가?
RQ3맥락 길이를 늘리는 것이 NarrativeQA와 같은 다운스트림 작업 및 장문 클러스터링/검색에 어떤 영향을 미치는가?
RQ4두 단계 미세조정(텍스트 페어와 하드 네거티브)이 검색 및 비검색 작업에 어떤 영향을 미치는가?
RQ5Jina Embeddings v2 모델이 오픈 소스이며 벤치마크 전반에서 경쟁력 있는 성능으로 Hugging Face를 통해 사용할 수 있는가?

주요 결과

모델	파라미터	MNLI	QQP	QNLI	SST-2	CoLa	STS-B	MRPC	RTE	WNLI	평균
BERT Base	110M	84.6/83.4	71.2	90.5	93.5	52.1	85.8	88.9	66.4	-	-
BERT Large	340M	86.7/85.9	72.1	92.7	94.9	60.5	86.5	89.3	70.1	-	-
RoBERTa	355M	90.8/90.2	90.2	98.9	96.7	67.8	92.2	92.3	88.2	89.0	88.5
Jina BERT Small	33M	80.1/78.9	78.9	86.0	89.6	28.8	84.8	84.1	68.8	55.5	72.9
Jina BERT Base	137M	85.7/85.4	80.7	92.2	94.5	51.4	89.5	88.4	78.7	65.1
Jina BERT Large	435M	86.6/85.9	80.9	92.5	95.0	59.6	88.2	88.5	78.5	65.1

8192-토큰 Jina BERT 기반 인코더가 다수의 MTEB 과제에서 최첨단 성능을 달성하고 벤치마크에서 ada-002에 필적한다.
ALiBi 양방향 어텐션은 위치 임베딩 없이 장-context 인코딩을 가능하게 하며 8192 토큰까지 MLM 정확도를 유지한다.
하드 네거티브를 통한 장-context 미세조정은 검색 중심 작업에서 검색 및 순위 성능을 향상시킨다.
대형 컨텍스트 평가에서 NarrativeQA와 같은 서사 및 장문 클러스터링 작업에서 성능 향상을 보였으며 문서 구조에 따라 일부 작업에서 혼합 효과가 나타났다.
모델과 데이터 세트가 Hugging Face에서 공개되어 자유롭게 접근 가능하다.

Figure 2: Variation of model MLM accuracy w.r.t. the sequence length

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.