[论文解读] An improved semantic similarity measure for document clustering based on topic maps
本文提出了一种基于主题图的新型语义相似度度量方法,用于文档聚类,通过将文档表示为结构化知识图来捕捉超越关键词匹配的语义关系。通过计算主题图中公共子树模式的相关性来衡量相似度,该方法在文本挖掘数据集上的表现优于传统的基于向量和基于WordNet的方法,展示了更高的聚类有效性。
A major computational burden, while performing document clustering, is the calculation of similarity measure between a pair of documents. Similarity measure is a function that assigns a real number between 0 and 1 to a pair of documents, depending upon the degree of similarity between them. A value of zero means that the documents are completely dissimilar whereas a value of one indicates that the documents are practically identical. Traditionally, vector-based models have been used for computing the document similarity. The vector-based models represent several features present in documents. These approaches to similarity measures, in general, cannot account for the semantics of the document. Documents written in human languages contain contexts and the words used to describe these contexts are generally semantically related. Motivated by this fact, many researchers have proposed seman-tic-based similarity measures by utilizing text annotation through external thesauruses like WordNet (a lexical database). In this paper, we define a semantic similarity measure based on documents represented in topic maps. Topic maps are rapidly becoming an industrial standard for knowledge representation with a focus for later search and extraction. The documents are transformed into a topic map based coded knowledge and the similarity between a pair of documents is represented as a correlation between the common patterns (sub-trees). The experimental studies on the text mining datasets reveal that this new similarity measure is more effective as compared to commonly used similarity measures in text clustering.
研究动机与目标
- 为解决基于向量的相似度度量在捕捉文档语义含义方面的局限性。
- 通过利用主题图进行结构化知识表示,提升文档聚类的有效性。
- 开发一种能够捕捉上下文和关系语义的语义相似度度量,超越词汇匹配。
- 在标准文本挖掘数据集上,将所提方法与既有的相似度度量方法进行对比评估。
提出的方法
- 将文档转换为主题图,将实体、概念及其关系表示为结构化知识图。
- 通过识别并关联两篇文档主题图中的公共子树模式来计算语义相似度。
- 通过主题图子树的结构对齐来量化语义相关性,强调共享的概念结构。
- 相似度分数源自文档对之间子树模式的重叠程度和结构一致性。
- 该方法避免依赖外部词典数据库(如WordNet),而是利用文档的内在结构进行语义推理。
实验结果
研究问题
- RQ1与向量空间模型相比,基于主题图的表示是否能提升文档聚类中的语义相似度度量?
- RQ2主题图子树的结构相似性与人工标注的文档相似性之间是否存在相关性?
- RQ3所提方法在聚类准确率上是否优于基于WordNet和传统基于向量的相似度度量方法?
- RQ4该方法在多大程度上保留了文档对之间的语义上下文和关系信息?
主要发现
- 所提出的基于主题图的相似度度量方法在基准文本挖掘数据集上的聚类准确率高于传统的向量空间模型。
- 该方法在捕捉上下文和关系语义方面表现优于基于WordNet的语义相似度度量,尤其表现出色。
- 主题图中公共子树模式的相关性能够有效反映语义相似度,即使在词汇内容不完全相同的文档中亦然。
- 实验结果证实,该方法在降低相似度计算的计算负担的同时,提升了聚类质量。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。