QUICK REVIEW

[논문 리뷰] ESimCSE: Enhanced Sample Building Method for Contrastive Learning of Unsupervised Sentence Embedding

Xing Wu, Chaochen Gao|arXiv (Cornell University)|2021. 09. 09.

Topic Modeling참고 문헌 22인용 수 69

한 줄 요약

ESimCSE는 비지도 문장 임베딩을 향상시키며 (1) 단어/부분단위 반복을 통한 더 안전한 양성 쌍 길이 확장을 사용하고, (2) 모멘텀 대조를 통한 음성 쌍 증가를 통해 벤치마크에서 SimCSE보다 더 강한 STS 성능을 보인다.

ABSTRACT

Contrastive learning has been attracting much attention for learning unsupervised sentence embeddings. The current state-of-the-art unsupervised method is the unsupervised SimCSE (unsup-SimCSE). Unsup-SimCSE takes dropout as a minimal data augmentation method, and passes the same input sentence to a pre-trained Transformer encoder (with dropout turned on) twice to obtain the two corresponding embeddings to build a positive pair. As the length information of a sentence will generally be encoded into the sentence embeddings due to the usage of position embedding in Transformer, each positive pair in unsup-SimCSE actually contains the same length information. And thus unsup-SimCSE trained with these positive pairs is probably biased, which would tend to consider that sentences of the same or similar length are more similar in semantics. Through statistical observations, we find that unsup-SimCSE does have such a problem. To alleviate it, we apply a simple repetition operation to modify the input sentence, and then pass the input sentence and its modified counterpart to the pre-trained Transformer encoder, respectively, to get the positive pair. Additionally, we draw inspiration from the community of computer vision and introduce a momentum contrast, enlarging the number of negative pairs without additional calculations. The proposed two modifications are applied on positive and negative pairs separately, and build a new sentence embedding method, termed Enhanced Unsup-SimCSE (ESimCSE). We evaluate the proposed ESimCSE on several benchmark datasets w.r.t the semantic text similarity (STS) task. Experimental results show that ESimCSE outperforms the state-of-the-art unsup-SimCSE by an average Spearman correlation of 2.02% on BERT-base.

연구 동기 및 목표

Transformer 인코더에서 같은 길이의 양성 쌍에서 발생하는 SimCSE의 편향을 식별한다.
양성 쌍의 의미를 보존하는 안전한 문장 길이 확장 방법을 개발한다.
모멘텀 대조를 사용하여 계산 비용이 과도하지 않으면서 정보량이 풍부한 음성 쌍의 수를 늘린다.
제안된 향상을 표준 의미 텍스트 유사성 벤치마크에서 평가한다.
결합된 접근 방식(ESimCSE)이 모델 전반에 걸쳐 비지도 문장 임베딩을 개선한다는 것을 입증한다.

제안 방법

의미를 보존하면서 양성 쌍의 한 쪽 구성원을 수정하기 위해 단어/부분단위 반복을 도입한다.
모멘텀 대조 큐를 적용하여 모멘텀으로 업데이트된 인코더와 함께 음의 쌍 집합을 확장한다.
두 가지 향상을 표준 SimCSE 목표에 결합하여 ESimCSE를 형성한다.
문장 임베딩을 위해 CLS에서 MLP를 사용하는 BERT/RoBERTa 백본으로 English Wikipedia(1M 문장)에서 학습한다.
SimCSE와 같은 드롭아웃 기반 양성 쌍 생성을 사용하되 양성 쌍은 독립적으로 구성된 쌍으로 한다.
STS 벤치마크와 Spearman 상관계수를 사용하여 평가하고 SimCSE 대비 평균 이득을 보고한다.

실험 결과

연구 질문

RQ1같은 길이 편향이 의미적 유사 학습에 부정적으로 영향을 미치며 안전한 길이 확장 방법으로 완화될 수 있는가?
RQ2단어/부분단위 반복이 의미를 왜곡하지 않으면서 더 안전하고 효과적인 양성 쌍 증강을 제공할 수 있는가?
RQ3모멘텀 대조를 도입하여 음의 쌍을 더 많게 만드는 것이 과도한 계산 비용 없이 비지도 문장 임베딩을 향상시키는가?
RQ4이러한 향상들을 결합했을 때 표준 STS 데이터셋 전체에서 기본 SimCSE 대비 전반적인 성능 향상은 얼마인가?

주요 결과

ESimCSE는 BERT-base에서 SimCSE보다 평균 Spearman 상관계수 증가 2.02%를 달성한다.
양성 쌍의 단어/부분단위 반복은 측정 가능한 STS 이득을 가져오며(예: STS-B 개발은 단어 반복으로 최대 1.64점, 모멘텀으로 1.53점, 전체 ESimCSE 구성으로 2.40점 향상).
모멘텀 대조는 큐와 모멘텀 업데이트 인코더를 통해 음의 쌍을 확장하여 메모리 비용이 과도하지 않게 학습을 향상시킨다.
모델 변형(BERT base/large, RoBERTa base/large) 전반에서 ESimCSE가 STS 벤치마크에서 일관되게 SimCSE를 능가한다.
전이 태스크에서 ESimCSE는 SimCSE에 비해 평균 성능이 소폭 증가(예: 86.06 대 85.81).
제거 연구에서 양성 쌍 향상과 음성 쌍 확장 모두 의미 있게 기여하며, 문장 길이 버킷분류는 제한적이거나 부정적 영향을 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.