QUICK REVIEW

[論文レビュー] A Comparative Study on TF-IDF feature Weighting Method and its Analysis using Unstructured Dataset

Mamata Das, Selvakumar Kamalanathan|arXiv (Cornell University)|Aug 8, 2023

Text and Document Classification Technologies参考文献 14被引用数 28

ひとこと要約

論文は unstructured reviews に対する感情分類で TF-IDF と N-Gram の特徴表現を比較し、特定の分類器（特に Random Forest）で TF-IDF が最高の性能を達成することを示しています。

ABSTRACT

Text Classification is the process of categorizing text into the relevant categories and its algorithms are at the core of many Natural Language Processing (NLP). Term Frequency-Inverse Document Frequency (TF-IDF) and NLP are the most highly used information retrieval methods in text classification. We have investigated and analyzed the feature weighting method for text classification on unstructured data. The proposed model considered two features N-Grams and TF-IDF on the IMDB movie reviews and Amazon Alexa reviews dataset for sentiment analysis. Then we have used the state-of-the-art classifier to validate the method i.e., Support Vector Machine (SVM), Logistic Regression, Multinomial Naive Bayes (Multinomial NB), Random Forest, Decision Tree, and k-nearest neighbors (KNN). From those two feature extractions, a significant increase in feature extraction with TF-IDF features rather than based on N-Gram. TF-IDF got the maximum accuracy (93.81%), precision (94.20%), recall (93.81%), and F1-score (91.99%) value in Random Forest classifier.

研究の動機と目的

未整理データにおけるテキスト分類における特徴量重み付けの影響を示す。
感情分析タスクにおける TF-IDF および N-Gram feature の評価。
データセットを横断して特徴表現を検証するために複数の分類器を評価する。
実務的な感情分析における TF-IDF の N-Gram に対する性能向上を示す。

提案手法

2 つの特徴抽出法である N-Grams と TF-IDF を IMDB および Amazon Alexa のレビューに対して感情分析に適用する。
複数の分類器を適用する: SVM、Logistic Regression、Multinomial Naive Bayes、Random Forest、Decision Tree、KNN。
特徴量法と分類器ごとの精度、適合率、再現率、F1-score を比較する。
最良の特徴量-方法/分類器の組み合わせを特定するために性能指標を報告する。

実験結果

リサーチクエスチョン

RQ1TF-IDF feature weighting は unstructured データセットの感情分析において N-Gram features より有意な性能優位性を提供するか。
RQ2与えられたデータセットで TF-IDF features を最も活用できる分類器はどれか。
RQ3IMDB および Amazon Alexa のレビューにおいて TF-IDF と N-Gram features は精度、適合率、再現率、F1 においてどのように比較されるか。

主な発見

TF-IDF features は評価された分類器全体で N-Gram features より高い性能を示した。
Random Forest は TF-IDF を用いて最高の指標を達成した：accuracy 93.81%、precision 94.20%、recall 93.81%、F1-score 91.99%。
TF-IDF ベースの特徴量抽出は unstructured datasets において N-Gram ベースの特徴より顕著な性能向上を示した。
本研究は movie および assistant reviews に対する感情分析の有効な特徴量重み付け手法として TF-IDF を検証する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。