QUICK REVIEW

[论文解读] Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy

Jay J. Jiang, David W. Conrath|ArXiv.org|Sep 20, 1997

Topic Modeling参考文献 19被引用 2,224

一句话总结

提出一种综合语义相似度度量，将基于WordNet的 taxonomy 与语料统计通过新颖的边强度和信息内容框架结合起来，在与人类判断的相关性方面优于先前模型。

ABSTRACT

This paper presents a new approach for measuring semantic similarity/distance between words and concepts. It combines a lexical taxonomy structure with corpus statistical information so that the semantic distance between nodes in the semantic space constructed by the taxonomy can be better quantified with the computational evidence derived from a distributional analysis of corpus data. Specifically, the proposed measure is a combined approach that inherits the edge-based approach of the edge counting scheme, which is then enhanced by the node-based approach of the information content calculation. When tested on a common data set of word pair similarity ratings, the proposed approach outperforms other computational models. It gives the highest correlation value (r = 0.828) with a benchmark based on human similarity judgements, whereas an upper bound (r = 0.885) is observed when human subjects replicate the same task.

研究动机与目标

在多义性和分类法结构存在的情况下，激发对测量语义相似性的挑战。
开发一个结合边基与节点基（信息内容）方法的模型。
将语料 derived 概率引入边强度以计算语义距离。
使用 WordNet 名词义项的人工语义相似度判断对模型进行评估。
评估参数敏感性并讨论 taxonomy 相关的相似性偏见。

提出的方法

用信息内容（IC）来定义概念并利用它通过最低共同上位概念（Equations 1–3）计算概念相似性。
将边强度（LS）建模为 P(child|parent) 的负对数，并将其与 IC 差（Equation 12）相关联。
计算综合边权重，包含深度、局部密度和链接类型（Equation 13）。
将语义距离导出为概念之间最短路径上边权的和（Equation 14）。
将距离专门化为与人类判断对比的距离到相似度转换（Equation 10）。
从 SemCor 提取概念频率并采用 Good-Turing 平滑以处理 IC 计算中的数据稀疏性。

实验结果

研究问题

RQ1将基于边的分层距离与基于信息内容的节点相似性结合，是否能提高与人类语义判断的一致性？
RQ2密度、深度和链接类型因素如何影响所提出的组合相似度量？
RQ3在标准名词对数据集上，组合模型是否优于基于节点的 Resnik（1995）和基于边的方法？
RQ4模型对参数设置 α（深度影响）和 β（密度影响）的敏感性如何？

主要发现

组合距离模型与人类判断的相关性为 r=0.828，明显高于基于节点的（r=0.794）和基于边的（r=0.600）基线。
观察到的最优参数为 α=0.5，β=0.3，其中 β 表明密度具有显著但非主导的影响。
使用 SemCor 的带有词位标注的频率并以 Good-Turing 平滑处理，提供比仅词频更精确的概念概率。
移除一个误分类的 furnace-stove 对可显著提升各模型的相关性（例如组合模型从 0.8654 提升至 0.8654？注：见表4文本）——论文指出在排除有问题的对时有显著提升。
该研究表明将信息内容作为决策因素与边强度一起应用，较 Resnik 的 IC 方法可带来可衡量的提升。
在他们提出的加权方案下，该方法仍然是一个有效的度量，并且符合度量性质。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。