QUICK REVIEW

[論文レビュー] An improved semantic similarity measure for document clustering based on topic maps

Muhammad Rafi, M. Shahid Shaikh|arXiv (Cornell University)|Mar 17, 2013

Advanced Text Analysis Techniques参考文献 11被引用数 23

ひとこと要約

本稿では、トピックマップを用いた文書クラスタリングのための新しい意味的類似度測定法を提案する。文書をキーワードマッチングを超えた意味的関係を捉える構造化された知識グラフとして表現することで、トピックマップ内の共通する部分木パターンの相関を計算することにより類似度を算出する。この手法は、テキストマイニングデータセットにおいて、従来のベクトルベースおよびWordNetベースの手法を上回り、クラスタリングの有効性が向上することを示している。

ABSTRACT

A major computational burden, while performing document clustering, is the calculation of similarity measure between a pair of documents. Similarity measure is a function that assigns a real number between 0 and 1 to a pair of documents, depending upon the degree of similarity between them. A value of zero means that the documents are completely dissimilar whereas a value of one indicates that the documents are practically identical. Traditionally, vector-based models have been used for computing the document similarity. The vector-based models represent several features present in documents. These approaches to similarity measures, in general, cannot account for the semantics of the document. Documents written in human languages contain contexts and the words used to describe these contexts are generally semantically related. Motivated by this fact, many researchers have proposed seman-tic-based similarity measures by utilizing text annotation through external thesauruses like WordNet (a lexical database). In this paper, we define a semantic similarity measure based on documents represented in topic maps. Topic maps are rapidly becoming an industrial standard for knowledge representation with a focus for later search and extraction. The documents are transformed into a topic map based coded knowledge and the similarity between a pair of documents is represented as a correlation between the common patterns (sub-trees). The experimental studies on the text mining datasets reveal that this new similarity measure is more effective as compared to commonly used similarity measures in text clustering.

研究の動機と目的

文書における意味的意味を捉えるために、従来のベクトルベース類似度測定法の限界を解消すること。
トピックマップによる構造化された知識表現を活用することで、文書クラスタリングの有効性を向上させること。
語彙的マッチングを超えた文脈的・関係的意味を捉える意味的類似度測定法を開発すること。
標準的なテキストマイニングデータセット上で、提案手法を既存の類似度測定法と比較すること。

提案手法

文書がトピックマップに変換され、エンティティ、概念、それらの関係が構造化された知識グラフとして表現される。
二つの文書のトピックマップ間で共通する部分木パターンを特定し、それらの相関を計算することで意味的類似度を算出する。
部分木パターンの構造的整合性を用いて意味的相関を定量化し、共有された概念的構造に焦点を当てる。
類似度スコアは、文書ペア間の部分木パターンの重複度と構造的一致性に基づいて導出される。
WordNetのような外部語彙データベースに依存せず、文書の内在的構造を用いて意味的推論を行う。

実験結果

リサーチクエスチョン

RQ1ベクトル空間モデルと比較して、トピックマップベースの表現は文書クラスタリングにおける意味的類似度測定を改善できるか？
RQ2トピックマップの部分木の構造的類似度は、人間によるアノテーションによる文書類似度とどの程度相関するか？
RQ3提案手法は、WordNetベースおよび従来のベクトルベース類似度測定法よりもクラスタリング精度で優れているか？
RQ4この手法は、文書ペアにおける意味的文脈および関係的情報をどの程度保持しているか？

主な発見

提案されたトピックマップベースの類似度測定法は、ベンチマークとして用いられるテキストマイニングデータセットにおいて、従来のベクトル空間モデルを上回るクラスタリング精度を達成した。
特に文脈的・関係的意味を捉える能力において、WordNetベースの意味的類似度測定法よりも優れた性能を示した。
トピックマップ内の共通する部分木パターンの相関は、語彙的内容が同一でない文書に対しても、意味的類似度を効果的に反映している。
実験結果から、このアプローチが類似度計算の計算負荷を低減するとともに、クラスタリング品質を向上させることを確認した。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。