QUICK REVIEW

[Paper Review] Swivel: Improving Embeddings by Noticing What's Missing

Noam Shazeer, Ryan Doherty|arXiv (Cornell University)|Feb 6, 2016

Advanced Graph Neural Networks13 references56 citations

TL;DR

Swivel is a scalable, distributed method for learning low-dimensional feature embeddings by performing approximate factorization of the point-wise mutual information (PMI) matrix derived from a co-occurrence matrix. It uses a piecewise loss function that explicitly models unobserved co-occurrences via a soft hinge loss, enabling superior performance on rare features while maintaining accuracy on common ones, and scales efficiently using vectorized computation and sharded matrix processing across distributed workers.

ABSTRACT

We present Submatrix-wise Vector Embedding Learner (Swivel), a method for generating low-dimensional feature embeddings from a feature co-occurrence matrix. Swivel performs approximate factorization of the point-wise mutual information matrix via stochastic gradient descent. It uses a piecewise loss with special handling for unobserved co-occurrences, and thus makes use of all the information in the matrix. While this requires computation proportional to the size of the entire matrix, we make use of vectorized multiplication to process thousands of rows and columns at once to compute millions of predicted values. Furthermore, we partition the matrix into shards in order to parallelize the computation across many nodes. This approach results in more accurate embeddings than can be achieved with methods that consider only observed co-occurrences, and can scale to much larger corpora than can be handled with sampling methods.

Motivation & Objective

To develop a scalable method for learning high-quality feature embeddings from large co-occurrence matrices that captures both observed and unobserved co-occurrences.
To address the limitation of existing methods like GloVe and SGNS, which either ignore unobserved co-occurrences or scale poorly with corpus size.
To improve embedding quality for rare features without degrading performance on frequent features.
To enable efficient, distributed training on massive co-occurrence matrices using vectorized operations and sharding.

Proposed method

Swivel performs stochastic gradient descent to approximate the factorization of the point-wise mutual information (PMI) matrix derived from a feature co-occurrence matrix.
It uses a piecewise loss function that differentiates between observed co-occurrences (with frequency-weighted error) and unobserved co-occurrences (using a soft hinge loss to prevent over-estimation of PMI).
The algorithm models the dot product of word and context embeddings as an approximation of the true PMI value: $ w_i^ op \tilde{w}_j \approx \text{pmi}(i;j) = \log x_{ij} + \log|D| - \log x_{i*} - \log x_{*j} $.
To scale efficiently, Swivel partitions the co-occurrence matrix into sharded submatrices, enabling parallel processing across multiple worker nodes.
Vectorized matrix multiplication is used to compute millions of predicted PMI values simultaneously, leveraging GPU acceleration for high throughput.
The block structure amortizes parameter transfer costs and reduces contention in distributed training environments.

Experimental results

Research questions

RQ1Can a method that explicitly models unobserved co-occurrences produce better embeddings than those that ignore them?
RQ2How does the inclusion of unobserved co-occurrence information affect performance on rare versus frequent features?
RQ3Can a count-based method like Swivel scale to larger corpora than sampling-based methods such as SGNS?
RQ4Does a piecewise loss function that treats observed and unobserved co-occurrences differently lead to more stable and accurate embeddings?
RQ5How effectively can vectorized and sharded computation enable scalable training on massive co-occurrence matrices?

Key findings

Swivel outperforms SGNS and GloVe on analogical reasoning tasks, particularly for rare words, where it achieves significantly better accuracy across all frequency buckets.
On the most frequent words, all models perform poorly, likely due to polysemy and high context diversity, but Swivel maintains consistent performance without degradation.
GloVe underperforms SGNS on rare words, suggesting it prioritizes fitting common words at the expense of rare ones, while Swivel avoids this trade-off.
Swivel’s performance is robust across word frequencies, with consistent improvements over SGNS and GloVe, especially in the low-frequency regime.
The method scales efficiently: a single GPU can estimate approximately 200 million PMI values per second for 1024-dimensional embeddings using vectorized matrix multiplication.
Swivel successfully parallelizes across hundreds of worker machines, demonstrating strong scalability in distributed environments due to sharding and parameter transfer amortization.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.