QUICK REVIEW

[论文解读] word2vec Parameter Learning Explained

Xin Rong|arXiv (Cornell University)|Nov 11, 2014

Topic Modeling参考文献 5被引用 659

一句话总结

本文提供了对 word2vec 模型中参数学习过程的全面数学推导与直观解释，涵盖连续词袋（CBOW）和跳字（skip-gram）架构，详细推导了使用随机梯度下降、层次化 softmax 和负采样方法的梯度更新。主要贡献在于以清晰、分步的方式解释了通过反向传播与优化技术学习词向量表示的过程，使非神经网络专家也能理解 word2vec 的内部机制。

ABSTRACT

The word2vec model and application by Mikolov et al. have attracted a great amount of attention in recent two years. The vector representations of words learned by word2vec models have been shown to carry semantic meanings and are useful in various NLP tasks. As an increasing number of researchers would like to experiment with word2vec or similar techniques, I notice that there lacks a material that comprehensively explains the parameter learning process of word embedding models in details, thus preventing researchers that are non-experts in neural networks from understanding the working mechanism of such models. This note provides detailed derivations and explanations of the parameter update equations of the word2vec models, including the original continuous bag-of-word (CBOW) and skip-gram (SG) models, as well as advanced optimization techniques, including hierarchical softmax and negative sampling. Intuitive interpretations of the gradient equations are also provided alongside mathematical derivations. In the appendix, a review on the basics of neuron networks and backpropagation is provided. I also created an interactive demo, wevi, to facilitate the intuitive understanding of the model.

研究动机与目标

为不具备高级神经网络专业知识的研究人员提供 word2vec 模型中参数学习过程的详细且易于理解的解释。
使用随机梯度下降推导并解释 CBOW 和 skip-gram 模型的梯度更新方程。
阐明 word2vec 中高级优化技术（如层次化 softmax 和负采样）的数学基础。
弥合词嵌入训练中直观理解与反向传播形式推导之间的差距。
通过交互式演示（wevi）和附录中对神经网络基础知识的回顾，支持学习。

提出的方法

将 word2vec 的损失函数推导为在给定上下文条件下预测正确目标词的负对数似然。
通过反向传播推导损失函数对输出层权重的梯度，得到更新规则：$ \mathbf{v}'_{w_j}^{\text{new}} = \mathbf{v}'_{w_j}^{\text{old}} - \eta (y_j - t_j) \mathbf{h} $。
通过将误差反向传播通过隐藏层，对输入向量应用相同的推导，以获得 $ \mathbf{v}_w $ 的更新。
引入并推导负采样方法，作为完整 softmax 的计算高效替代方案，使用噪声分布 $ P_n(w) $ 对负样本词进行采样。
推导负采样的损失函数：$ E = -\log\sigma(\mathbf{v}'_{w_O}^T \mathbf{h}) - \sum_{w_j \in \mathcal{W}_{\text{neg}}} \log\sigma(-\mathbf{v}'_{w_j}^T \mathbf{h}) $。
在负采样下推导输出向量和输入向量的梯度更新：$ \mathbf{v}'_{w_j}^{\text{new}} = \mathbf{v}'_{w_j}^{\text{old}} - \eta (\sigma(\mathbf{v}'_{w_j}^T \mathbf{h}) - t_j) \mathbf{h} $，该规则仅应用于正样本词和采样的负样本词。

实验结果

研究问题

RQ1在 CBOW 和 skip-gram 模型中，词向量参数如何在训练过程中更新？
RQ2在标准 word2vec 模型中使用 softmax 时，输出层权重梯度更新规则的数学推导是什么？
RQ3负采样如何在保持有效词向量学习的同时降低计算成本？
RQ4噪声分布 $ P_n(w) $ 在负采样中的作用是什么，它如何影响训练目标？
RQ5反向传播如何将误差从输出层传播到输入层，以更新输入词向量？

主要发现

损失函数对输出向量 $ \mathbf{v}'_{w_j} $ 的梯度为 $ (\sigma(\mathbf{v}'_{w_j}^T \mathbf{h}) - t_j) \mathbf{h} $，其中 $ t_j = 1 $ 当 $ w_j $ 是正确输出词，否则为 0。
负采样下输出向量的更新规则仅作用于正样本词和 K 个采样的负样本词，与完整 softmax 相比显著降低了计算成本。
隐藏层输出梯度 $ \partial E / \partial \mathbf{h} $ 是各输出向量梯度按其预测误差加权后的总和，从而支持误差向输入向量的反向传播。
CBOW 的输入向量更新通过将所有上下文词的隐藏层误差相加推导得出，结果为 $ \mathbf{v}_{w_c}^{\text{new}} = \mathbf{v}_{w_c}^{\text{old}} - \eta \cdot \text{EH} / C $，其中 EH 为输出层的总误差。
本文证实，负采样可生成高质量的词嵌入，且训练时间显著缩短，这一结论在先前的工作中已通过实证验证（Mikolov et al., 2013b）。
推导表明，负采样目标函数等价于最小化一种对比损失，该损失促使正确词的相似度得分高于负样本。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。