QUICK REVIEW

[论文解读] SOLAR: Sparse Orthogonal Learned and Random Embeddings

Tharun Medini, Beidi Chen|arXiv (Cornell University)|May 3, 2021

Text and Document Classification Technologies参考文献 34被引用 6

一句话总结

本文提出SOLAR方法，用于训练超稀疏、高维嵌入（最高可达500K维），以快速查找替代昂贵的近邻搜索。通过使用随机、稀疏、近正交的标签向量以及学习得到的稀疏查询向量，SOLAR在图书搜索和多标签分类任务中实现了最先进（SOTA）的精确率与召回率，推理速度最快提升10倍，同时通过一种新型分区方案实现了无通信的多GPU训练。

ABSTRACT

Dense embedding models are commonly deployed in commercial search engines, wherein all the document vectors are pre-computed, and near-neighbor search (NNS) is performed with the query vector to find relevant documents. However, the bottleneck of indexing a large number of dense vectors and performing an NNS hurts the query time and accuracy of these models. In this paper, we argue that high-dimensional and ultra-sparse embedding is a significantly superior alternative to dense low-dimensional embedding for both query efficiency and accuracy. Extreme sparsity eliminates the need for NNS by replacing them with simple lookups, while its high dimensionality ensures that the embeddings are informative even when sparse. However, learning extremely high dimensional embeddings leads to blow up in the model size. To make the training feasible, we propose a partitioning algorithm that learns such high dimensional embeddings across multiple GPUs without any communication. This is facilitated by our novel asymmetric mixture of Sparse, Orthogonal, Learned and Random (SOLAR) Embeddings. The label vectors are random, sparse, and near-orthogonal by design, while the query vectors are learned and sparse. We theoretically prove that our way of one-sided learning is equivalent to learning both query and label embeddings. With these unique properties, we can successfully train 500K dimensional SOLAR embeddings for the tasks of searching through 1.6M books and multi-label classification on the three largest public datasets. We achieve superior precision and recall compared to the respective state-of-the-art baselines for each task with up to 10 times faster speed.

研究动机与目标

解决商业搜索引擎中使用的密集嵌入模型在查询时间与准确率方面的瓶颈问题。
探究在高维嵌入中实现极端稀疏性是否能够以高效查找替代昂贵的近邻搜索，同时保持模型的信息量。
设计一种可扩展至500K维嵌入的训练方法，且在多张GPU之间无通信开销。
在检索与分类性能上超越最先进（SOTA）的密集嵌入模型。

提出的方法

提出SOLAR框架：一种由稀疏、正交、学习得到和随机嵌入组成的混合模型，其中标签向量为随机且稀疏，查询向量为学习得到。
采用一种新颖的非对称单边学习策略：仅训练查询向量，而固定标签向量为随机、稀疏且近正交。
理论上证明，在所提出的设计下，单边学习等价于联合训练查询与标签嵌入。
设计一种分区算法，通过将嵌入空间分布到多张GPU上，实现无通信的多GPU高维嵌入训练。
使用一种损失函数，优化学习得到的查询向量与固定随机标签向量之间的相似度，从而实现高效且可扩展的训练。
利用随机稀疏标签向量的近正交性，即使在极端稀疏情况下也能保持高表达能力。

实验结果

研究问题

RQ1超稀疏、高维嵌入是否能在检索与分类任务中替代密集嵌入，同时实现更高的效率与准确率？
RQ2单边学习（仅训练查询向量，而固定随机、稀疏、近正交的标签向量）是否等价于联合训练查询与标签嵌入？
RQ3能否在无GPU间通信的情况下，高效训练500K维的嵌入？
RQ4所提出的方法是否在真实世界的检索与分类任务中，于精确率、召回率与推理速度方面均优于最先进（SOTA）的密集嵌入模型？

主要发现

SOLAR在160万本图书的搜索任务中实现了最先进（SOTA）的精确率与召回率，优于现有的密集嵌入基线模型。
在三个最大的公开多标签分类数据集上，SOLAR的表现优于各自对应的最先进（SOTA）基线模型。
由于用简单向量查找替代了近邻搜索，该方法的推理速度相比基线密集嵌入模型最高提升10倍。
该模型成功实现了在多张GPU上无GPU间通信地训练500K维嵌入，实现了可扩展的训练。
理论分析证实，在所提出的框架下，使用固定、稀疏、近正交标签向量的单边学习等价于完整的联合训练。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。