QUICK REVIEW

[论文解读] Fuzzy paraphrases in learning word representations with a corpus and a lexicon.

Yuanzhi Ke, Masafumi Hagiwara|arXiv (Cornell University)|Nov 2, 2016

Natural Language Processing Techniques被引用 1

一句话总结

本文提出了一种新方法，通过有选择性地从词典中引入模糊同义表达，并利用可靠性评分在训练过程中动态剔除不可靠的同义表达，从而改善词向量表示。该方法减少了与多义性相关的噪声，优于先前的方法，且在不需多向量建模的情况下保持每个词仅一个向量。

ABSTRACT

A synonym of a polysemous word is usually only the paraphrase of one sense among many. When lexicons are used to improve vector-space word representations, such paraphrases are unreliable and bring noise to the vector-space. The prior works use a coefficient to adjust the overall learning of the lexicons. They regard the paraphrases equally. In this paper, we propose a novel approach that regards the paraphrases diversely to alleviate the adverse effects of polysemy. We annotate each paraphrase with a degree of reliability. The paraphrases are randomly eliminated according to the degrees when our model learns word representations. In this way, our approach drops the unreliable paraphrases, keeping more reliable paraphrases at the same time. The experimental results show that the proposed method improves the word vectors. Our approach is an attempt to address the polysemy problem keeping one vector per word. It makes the approach easier to use than the conventional methods that estimate multiple vectors for a word. Our approach also outperforms the prior works in the experiments.

研究动机与目标

为解决词表示学习中的多义性挑战，其中同义性通常具有特定词义，若统一使用则会引入噪声。
通过为同义关系分配置信度程度，提高基于词典的词向量训练的可靠性。
开发一种方法，在保持每个词仅一个向量的同时，减少模糊或错误同义表达的不利影响。
优于现有方法，后者对所有同义表达一视同仁，或采用复杂的多向量建模。

提出的方法

词典中的每个同义表达均标注有可靠性评分，反映其对特定词义的语义准确性。
在模型训练过程中，同义表达根据其可靠性评分按比例随机剔除，优先保留更可信的关系。
通过加权损失函数将词典整合到词表示学习中，降低不可靠同义表达信号的影响。
该方法保持每个词仅一个向量，避免了词义消歧或多向量方法的复杂性。
模型通过优化基于语料的损失与基于词典的正则化组合，结合动态同义表达过滤，学习词向量。
可靠性评分可学习或预先赋予，基于语言置信度，实现对噪声同义表达的选择性抑制。

实验结果

研究问题

RQ1基于可靠性的同义表达过滤是否能在多义性存在的情况下提升词向量质量？
RQ2在训练过程中动态剔除不可靠同义表达，是否能获得优于统一处理所有同义表达的词表示？
RQ3通过选择性使用词典信息，单向量词表示模型能否实现更优性能？
RQ4所提出方法与使用固定系数或多个向量处理多义词的先前方法相比如何？

主要发现

所提方法通过减少词典中不可靠同义表达的噪声，提升了词向量质量。
与对所有同义表达统一应用系数的先前方法相比，该方法在词相似度和类比任务上表现更优。
通过仅保留可靠的同义表达，模型在保持每个词仅一个向量的同时，性能可与更复杂的多向量方法相媲美。
基于可靠性评分动态剔除同义表达，可生成更鲁棒、更准确的词表示。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。