QUICK REVIEW

[论文解读] Negative Sampling for Contrastive Representation Learning: A Review

Lanling Xu, Jianxun Lian|arXiv (Cornell University)|Jun 1, 2022

Domain Adaptation and Few-Shot Learning被引用 20

一句话总结

本文综述了跨自然语言处理、计算机视觉、信息检索和图的对比表示学习中的负采样技术，将方法分为四类，并概述取舍与未解问题。

ABSTRACT

The learn-to-compare paradigm of contrastive representation learning (CRL), which compares positive samples with negative ones for representation learning, has achieved great success in a wide range of domains, including natural language processing, computer vision, information retrieval and graph learning. While many research works focus on data augmentations, nonlinear transformations or other certain parts of CRL, the importance of negative sample selection is usually overlooked in literature. In this paper, we provide a systematic review of negative sampling (NS) techniques and discuss how they contribute to the success of CRL. As the core part of this paper, we summarize the existing NS methods into four categories with pros and cons in each genre, and further conclude with several open research questions as future directions. By generalizing and aligning the fundamental NS ideas across multiple domains, we hope this survey can accelerate cross-domain knowledge sharing and motivate future researches for better CRL.

研究动机与目标

定义负采样在对比表示学习（CRL）中的作用及其对表示质量的影响。
在各领域系统性地对现有负采样方法进行分类。
分析每个 NS 分类的利弊与取舍，以指导未来研究。
识别CRL中负采样的未解问题与未来方向。

提出的方法

基于 InfoNCE 和采样分布 p(y^{-})，用一般的负采样目标来表述 CRL。
将 NS 方法分为四类：Static NS、Dynamic NS、Adversarial NS 和 Efficient NS，并在每一类中详细介绍具有代表性的方法。
概述领域应用（NLP、CV、IR、GRL）及在每个任务中NS的具体实现方式。
分析负采样器的四个关键属性：效率、有效性、稳定性和数据无关性，并讨论取舍。
提供一个统一框架和算法，将静态、hard-Negative Sampling 和对抗策略作为特殊情况进行说明。

实验结果

研究问题

RQ1CRL 的主要负采样方法类别有哪些？它们各自的取舍是什么？
RQ2负采样策略如何影响跨 NLP、CV、IR 与图学习的 CRL 性能？
RQ3在设计高效、有效且稳定的 NS 方法方面存在哪些未解决的研究挑战与未来方向？
RQ4如何在 CRL 中平衡易负样本与难负样本并减小错误负样本？

主要发现

Static NS 简单、稳定、快速且数据无关，但在训练中可能产生次优的负样本。
Dynamic NS (DNS) 能挖掘困难负样本并加速收敛，但存在错误负样本的风险并增加计算成本。
Adversarial NS 利用类似 GAN 的极小极大框架生成困难负样本，但增加了复杂性和潜在的不稳定性。
Efficient NS 使用批内采样和缓存来扩展 CRL 到大规模数据集，但在内存使用和陈旧负样本方面存在权衡。
在各领域，智能负采样普遍提升 CRL 性能，但没有一种方法在四个理想属性（效率、有效性、稳定性、数据无关性）都出众。
本文指出的未解决方向包括缓解错误负样本、结合易负样本与难负样本、在质量与数量之间取得平衡，以及探索无负采样的方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。