QUICK REVIEW

[Paper Review] A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques

Mehdi Allahyari, Seyed Amin Pouriyeh|arXiv (Cornell University)|Jul 10, 2017

Advanced Text Analysis Techniques123 references512 citations

TL;DR

This survey reviews fundamental text mining tasks and techniques, including preprocessing, representation, classification, clustering, and domain applications such as biomedical text mining.

ABSTRACT

The amount of text that is generated every day is increasing dramatically. This tremendous volume of mostly unstructured text cannot be simply processed and perceived by computers. Therefore, efficient and effective techniques and algorithms are required to discover useful patterns. Text mining is the task of extracting meaningful information from text, which has gained significant attentions in recent years. In this paper, we describe several of the most fundamental text mining tasks and techniques including text pre-processing, classification and clustering. Additionally, we briefly explain text mining in biomedical and health care domains.

Motivation & Objective

Explain the key concepts, tasks, and relationships in text mining and knowledge discovery from text (KDT).
Describe core preprocessing, representation, and learning approaches used in text mining.
Outline supervised and unsupervised methods for classification and clustering in text data.
Discuss domain-specific applications such as biomedical text mining and sentiment analysis.

Proposed method

Introduce text mining concepts and distinguish KDD from data mining.
Describe text representation via bag-of-words and vector space models (including TF-IDF).
Present preprocessing steps (tokenization, filtering, lemmatization, stemming) and their impact on classification.
Review classification algorithms (Naive Bayes, nearest neighbor, decision trees, SVM) and evaluation metrics (precision, recall, F1).
Discuss clustering approaches (hierarchical, k-means, probabilistic topics) and topic models (pLSA, LDA).
Highlight text mining in special domains (information retrieval, NLP, information extraction, text summarization, and biomedical text mining).

Experimental results

Research questions

RQ1What are the fundamental tasks and components of text mining?
RQ2How do preprocessing, representation, and learning methods influence text mining performance?
RQ3What are the main supervised and unsupervised techniques used for text classification and clustering?
RQ4How are topic models and probabilistic methods applied to text data?
RQ5What are domain-specific considerations in biomedical text mining and sentiment analysis?

Key findings

The paper consolidates core text mining tasks: preprocessing, representation, classification, clustering, information retrieval, and information extraction.
Bag-of-words with vector space models and TF-IDF are central to document representation and similarity calculations.
A range of classification methods are reviewed, including Naive Bayes, nearest neighbor, decision trees, and SVM, with discussion of their relative strengths.
Clustering is presented with hierarchical, k-means, and probabilistic/ topic-model based approaches (pLSA, LDA).
Topic models (LDA, pLSA) are identified as powerful unsupervised methods for discovering themes in text collections.
Domain-specific discussions include information extraction, text summarization, opinion mining, and biomedical text mining.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.