Skip to main content
QUICK REVIEW

[Paper Review] DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations

John Giorgi, Osvald Nitski|arXiv (Cornell University)|Jun 5, 2020
Topic Modeling78 references97 citations
TL;DR

DeCLUTR introduces a self-supervised, contrastive objective to learn universal sentence embeddings by contrasting anchor–positive spans drawn from nearby text, extending MLM pretraining to produce strong unsupervised sentence representations.

ABSTRACT

Sentence embeddings are an important component of many natural language processing (NLP) systems. Like word embeddings, sentence embeddings are typically learned on large text corpora and then transferred to various downstream tasks, such as clustering and retrieval. Unlike word embeddings, the highest performing solutions for learning sentence embeddings require labelled data, limiting their usefulness to languages and domains where labelled data is abundant. In this paper, we present DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations. Inspired by recent advances in deep metric learning (DML), we carefully design a self-supervised objective for learning universal sentence embeddings that does not require labelled training data. When used to extend the pretraining of transformer-based language models, our approach closes the performance gap between unsupervised and supervised pretraining for universal sentence encoders. Importantly, our experiments suggest that the quality of the learned embeddings scale with both the number of trainable parameters and the amount of unlabelled training data. Our code and pretrained models are publicly available and can be easily adapted to new domains or used to embed unseen text.

Motivation & Objective

  • Motivate learning universal sentence embeddings without labeled data.
  • Design a self-supervised objective inspired by deep metric learning (DML) to train sentence encoders.
  • Show that combining contrastive learning with MLM pretraining improves downstream sentence tasks.
  • Demonstrate scaling behavior with model size and data.
  • Provide open-source code and pretrained models for domain transfer.

Proposed method

  • Use a transformer encoder f(·) and a mean-pooling pooler g(·) to obtain fixed-length embeddings.
  • Train with a contrastive NT-Xent loss that pulls anchor and positive spans together while treating other spans in the minibatch as negatives.
  • Sample anchor and positive spans from nearby text within documents; anchor spans are longer than positives to enable subsumed/global view learning.
  • Continue pretraining an existing MLM model (DistilRoBERTa or RoBERTa-base) with the proposed contrastive objective alongside MLM loss.
  • Span sampling uses beta-distributed lengths to cover sentence- to paragraph-length text, with anchors and positives drawn from the same document.
  • Evaluate using SentEval on 18 downstream and 10 probing tasks to assess both performance and linguistic properties.
  • Open-source code and pretrained models are released at the project repository.

Experimental results

Research questions

  • RQ1Can a self-supervised, contrastive objective produce universal sentence embeddings without labeled data?
  • RQ2How does extending MLM pretraining with the contrastive objective affect downstream sentence tasks compared to baseline pretrained models?
  • RQ3What architectural choices and data scales optimize the quality of learned embeddings?
  • RQ4Do the learned embeddings retain linguistic information as measured by probing tasks?

Key findings

  • DeCLUTR-base and DeCLUTR-small pretrained models substantially improve average downstream SentEval performance over their underlying pretrained transformers (e.g., DeCLUTR-base Avg 79.10 vs Transformer-base Avg 72.19).
  • DeCLUTR-base matches or exceeds supervised/semi-supervised baselines on many downstream tasks without labeled data.
  • On probing tasks, DeCLUTR models retain linguistic information comparable to the underlying pretrained models, unlike some supervised finetuned alternatives.
  • Performance scales with model size and amount of unlabelled training data, suggesting further gains with larger models or more data.
  • The method remains competitive with, and in some cases surpasses, existing unsupervised baselines (e.g., QuickThoughts) across SentEval tasks.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.