[論文レビュー] SimCSE: Simple Contrastive Learning of Sentence Embeddings
tldr: SimCSE は、 dropout をノイズとして unsupervised training に、自然言語推論(NLI)ペアを supervised training に用いる、普遍的な文エンベディングのための単純な対比学習フレームワークを導入し、STS タスクで最先端の結果を達成します。
This paper presents SimCSE, a simple contrastive learning framework that greatly advances state-of-the-art sentence embeddings. We first describe an unsupervised approach, which takes an input sentence and predicts itself in a contrastive objective, with only standard dropout used as noise. This simple method works surprisingly well, performing on par with previous supervised counterparts. We find that dropout acts as minimal data augmentation, and removing it leads to a representation collapse. Then, we propose a supervised approach, which incorporates annotated pairs from natural language inference datasets into our contrastive learning framework by using "entailment" pairs as positives and "contradiction" pairs as hard negatives. We evaluate SimCSE on standard semantic textual similarity (STS) tasks, and our unsupervised and supervised models using BERT base achieve an average of 76.3% and 81.6% Spearman's correlation respectively, a 4.2% and 2.2% improvement compared to the previous best results. We also show -- both theoretically and empirically -- that the contrastive learning objective regularizes pre-trained embeddings' anisotropic space to be more uniform, and it better aligns positive pairs when supervised signals are available.
研究の動機と目的
- Motivate and improve universal sentence embeddings with a simple contrastive objective.
- Show that dropout provides minimal augmentation and prevents representation collapse in unsupervised training.
- Demonstrate that NLI-derived positives and hard negatives improve supervised sentence embeddings.
- Quantitatively evaluate on STS tasks and transfer tasks to establish state-of-the-art performance.
提案手法
- Use a contrastive loss with in-batch negatives and a temperature parameter to learn sentence embeddings from pre-trained encoders.
- Unsupervised: generate positive pairs by applying two different dropout masks to the same sentence and treating other in-batch sentences as negatives (no extra data augmentation).
- Supervised: form positives from entailment pairs in NLI datasets and use contradiction pairs as hard negatives within the contrastive objective; evaluate both with and without additional neutral examples as negatives.
実験結果
リサーチクエスチョン
- RQ1Can a simple contrastive objective with minimal data augmentation produce strong unsupervised sentence embeddings?
- RQ2Does incorporating NLI-derived positive pairs and hard negatives improve supervised sentence embeddings over prior methods?
- RQ3How do alignment and uniformity of embeddings relate to the performance gains observed with SimCSE?
主な発見
- Unsupervised SimCSE with dropout achieves strong STS performance, surpassing many prior supervised methods when using BERT-base.
- Unsupervised SimCSE yields an average Spearman’s correlation of 76.3% on STS tasks with BERT-base, representing a 4.2% improvement over previous best results.
- Supervised SimCSE using entailment positives from SNLI+MNLI and hard negatives from contradictions raises STS performance to 81.6% on average with BERT-base, a 2.2% improvement over the unsupervised version.
- Using NLI datasets as supervision is particularly effective for learning sentence embeddings, and adding hard negatives further improves results (e.g., 84.9% to 86.2% on STS-B with BERT-base).
- The contrastive objective regularizes the embedding space, flattening the singular value spectrum and improving uniformity, which complements alignment gains from supervised signals.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。