QUICK REVIEW

[Paper Review] Evaluating vector-space models of analogy

Dawn Chen, Joshua C. Peterson|arXiv (Cornell University)|May 12, 2017

Cognitive Science and Education Research14 references31 citations

TL;DR

This paper evaluates the parallelogram model of analogy in modern word embeddings like word2vec and GloVe by comparing their predictions of relational similarity to human judgments. It finds that while the model captures some semantic relations well, it fails to replicate human violations of geometric constraints like the triangle inequality, revealing fundamental limitations in modeling human-like analogy reasoning.

ABSTRACT

Vector-space representations provide geometric tools for reasoning about the similarity of a set of objects and their relationships. Recent machine learning methods for deriving vector-space embeddings of words (e.g., word2vec) have achieved considerable success in natural language processing. These vector spaces have also been shown to exhibit a surprising capacity to capture verbal analogies, with similar results for natural images, giving new life to a classic model of analogies as parallelograms that was first proposed by cognitive scientists. We evaluate the parallelogram model of analogy as applied to modern word embeddings, providing a detailed analysis of the extent to which this approach captures human relational similarity judgments in a large benchmark dataset. We find that that some semantic relationships are better captured than others. We then provide evidence for deeper limitations of the parallelogram model based on the intrinsic geometric constraints of vector spaces, paralleling classic results for first-order similarity.

Motivation & Objective

To assess how well modern vector-space models (word2vec, GloVe) predict human judgments of relational similarity in verbal analogies.
To investigate whether the parallelogram model of analogy—where relations are represented as vector differences—accurately reflects human cognitive similarity judgments.
To examine whether human relational similarity judgments violate geometric constraints (e.g., triangle inequality) that constrain vector space models.
To determine whether the limitations of vector-space models stem from intrinsic geometric properties rather than suboptimal embedding methods.

Proposed method

Collected a new dataset of 5,000 word pair comparisons across 10 semantic relation types, including class-inclusion, contrast, and part-whole.
Administered a human rating task where participants evaluated analogy quality on a 7-point scale, with 12 triads of analogies (1-2, 2-3, 1-3 types) to test relational similarity.
Calculated predicted relational similarity using cosine similarity between difference vectors (e.g., v_queen - v_king) in word2vec and GloVe embeddings.
Conducted repeated-measures ANOVAs on human ratings and separate between-subjects ANOVAs on predicted similarities to test for effects of analogy type.
Used Tukey HSD post hoc tests to compare mean ratings and predicted similarities across analogy types (1-2, 2-3, 1-3).
Analyzed violations of geometric axioms (symmetry, triangle inequality) in human judgments and compared them to predictions from vector space models.

Experimental results

Research questions

RQ1To what extent do word2vec and GloVe embeddings predict human judgments of relational similarity in verbal analogies?
RQ2Do human judgments of relational similarity violate geometric constraints such as the triangle inequality, and if so, how does this affect vector-space models?
RQ3Are there specific semantic relation types (e.g., similar, part-of) for which the parallelogram model performs better than others?
RQ4Can the failure of vector-space models to predict human relational similarity be attributed to intrinsic geometric constraints of vector spaces?
RQ5How do the predictions of word2vec and GloVe embeddings compare to human ratings in terms of relational similarity for different analogy structures?

Key findings

Human ratings showed a significant effect of analogy type on quality, with type 1-2 (M=5.44, SD=.99) and type 2-3 (M=5.43, SD=.63) rated significantly higher than type 1-3 (M=2.99, SD=.46), p<.001.
The ANOVA on human ratings revealed a significant effect of analogy type, F(2,33)=45.57, p<.001, indicating that participants perceived relational similarity differently based on structure.
Predicted relational similarities from word2vec and GloVe showed no significant effect of analogy type: F(2,33)=1.20, p=.31 for word2vec and F(2,33)=.24, p=.79 for GloVe.
In 7 out of 12 triads, the expected pattern (1-2 and 2-3 rated higher than 1-3) was statistically significant in human ratings, but this pattern was not consistently predicted by vector models.
Human judgments violated the triangle inequality, as evidenced by higher ratings for 1-2 and 2-3 analogies compared to 1-3, even though the 1-3 analogy should be the most similar if the triangle inequality held.
The failure of vector-space models to replicate human relational similarity patterns stems from intrinsic geometric constraints—such as the triangle inequality—that cannot be overcome by better embedding methods.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.