QUICK REVIEW

[論文レビュー] A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques

Mehdi Allahyari, Seyed Amin Pouriyeh|arXiv (Cornell University)|Jul 10, 2017

Advanced Text Analysis Techniques参考文献 123被引用数 512

ひとこと要約

このサーベイは、preprocessing、representation、classification、clustering、および biomedical text mining のようなドメイン応用を含む、基本的な text mining タスクと技術をレビューします。

ABSTRACT

The amount of text that is generated every day is increasing dramatically. This tremendous volume of mostly unstructured text cannot be simply processed and perceived by computers. Therefore, efficient and effective techniques and algorithms are required to discover useful patterns. Text mining is the task of extracting meaningful information from text, which has gained significant attentions in recent years. In this paper, we describe several of the most fundamental text mining tasks and techniques including text pre-processing, classification and clustering. Additionally, we briefly explain text mining in biomedical and health care domains.

研究の動機と目的

text mining およびテキストからの知識発見（KDT）における重要な概念、タスク、関係を説明する。
text mining で使用されるコアな preprocessing、representation、learning approaches を説明する。
text data における classification および clustering の supervised および unsupervised methods を概説する。
生物医学テキストマイニングや sentiment analysis などのドメイン特有の応用について論じる。

提案手法

text mining の概念を導入し、KDD と data mining を区別する。
bag-of-words および vector space models（including TF-IDF）による text representation を説明する。
tokenization、filtering、lemmatization、stemming などの preprocessing steps と、それらが classification に与える影響を説明する。
classification algorithms（Naive Bayes、nearest neighbor、decision trees、SVM）と評価指標（precision、recall、F1）をレビューする。
clustering approaches（hierarchical、k-means、probabilistic topics）と topic models（pLSA、LDA）を検討する。
information retrieval、NLP、information extraction、text summarization、biomedical text mining などの special domains における text mining を強調する。

実験結果

リサーチクエスチョン

RQ1text mining の fundamental tasks and components は何か。
RQ2preprocessing、representation、learning methods は text mining の performance にどのように影響するのか。
RQ3text classification および clustering に用いられる main supervised and unsupervised techniques は何か。
RQ4topic models and probabilistic methods は text data にどのように適用されるのか。
RQ5biomedical text mining や sentiment analysis における domain-specific considerations は何か。

主な発見

本論文は core text mining tasks を統合して整理している：preprocessing、representation、classification、clustering、information retrieval、information extraction。
Bag-of-words および vector space models と TF-IDF は document representation と similarity calculations の中心である。
Naive Bayes、nearest neighbor、decision trees、SVM など、さまざまな分類手法をレビューし、それぞれの相対的な長所を論じている。
clustering は hierarchical、k-means、probabilistic/ topic-model based approaches（pLSA、LDA）を用いた方法で提示されている。
topic models（LDA、pLSA）はテキストコレクション内のテーマを発見する強力なunsupervised な手法として特定されている。
domain-specific discussions には information extraction、text summarization、opinion mining、biomedical text mining などが含まれる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。