QUICK REVIEW

[论文解读] Yelp Dataset Challenge: Review Rating Prediction

Nabiha Asghar|arXiv (Cornell University)|May 17, 2016

Sentiment Analysis and Opinion Mining参考文献 10被引用 35

一句话总结

本文提出一种多分类方法，通过结合四种特征提取方法（unigrams、bigrams、trigrams、LSI）与四种机器学习算法（logistic regression、Naïve Bayes、perceptrons、linear SVM），利用16种模型从自由文本评论中预测Yelp评论评分（1–5星）。表现最佳的模型是基于前10,000个unigram和bigram特征的logistic regression模型，在测试集上达到0.92的F1-score，且在交叉验证中优于所有其他模型。

ABSTRACT

Review websites, such as TripAdvisor and Yelp, allow users to post online reviews for various businesses, products and services, and have been recently shown to have a significant influence on consumer shopping behaviour. An online review typically consists of free-form text and a star rating out of 5. The problem of predicting a user's star rating for a product, given the user's text review for that product, is called Review Rating Prediction and has lately become a popular, albeit hard, problem in machine learning. In this paper, we treat Review Rating Prediction as a multi-class classification problem, and build sixteen different prediction models by combining four feature extraction methods, (i) unigrams, (ii) bigrams, (iii) trigrams and (iv) Latent Semantic Indexing, with four machine learning algorithms, (i) logistic regression, (ii) Naive Bayes classification, (iii) perceptrons, and (iv) linear Support Vector Classification. We analyse the performance of each of these sixteen models to come up with the best model for predicting the ratings from reviews. We use the dataset provided by Yelp for training and testing the models.

研究动机与目标

为解决从自由文本评论中预测星级评分的挑战，这是情感分析和推荐系统中的关键问题。
评估不同特征提取与机器学习组合在评分预测中的有效性。
识别在真实世界Yelp数据上实现准确、可泛化评论评分预测的最优模型配置。
为未来在无显式评分系统中的自动化评分预测研究提供基线和框架。

提出的方法

将评论评分预测视为一个五分类问题，以星级评分为类别标签。
应用四种特征提取技术：unigrams、bigrams、trigrams以及潜在语义索引（LSI）于文本评论。
将每种特征方法与四种监督学习算法结合：logistic regression、Naïve Bayes、perceptrons以及线性支持向量分类。
使用k折交叉验证（3折）进行模型评估与超参数调优。
每种方法选取前10,000个特征以降低维度并提升计算效率。
使用F1-score、精确率、召回率和混淆矩阵评估模型，测试集性能与验证结果进行对比。

实验结果

研究问题

RQ1哪种特征提取与机器学习算法的组合能为Yelp评论评分提供最高的预测准确率？
RQ2不同n-gram与LSI基础的特征表示在捕捉文本中情感与评分相关信号方面表现如何？
RQ3模型在测试集上的性能相较于交叉验证的下降程度如何，表明是否存在潜在过拟合？
RQ4通过考虑星级评分（1至5）的固有顺序，有序/有序logistic regression是否能提升性能？
RQ5非线性模型或高级特征工程（如POS词性标注、主题建模）与线性模型在此任务上的表现相比如何？

主要发现

在unigrams和bigrams的前10,000个特征上训练的logistic regression模型在测试集上取得了最高的F1-score（0.92），在交叉验证中达到0.95。
该模型的测试性能（F1: 0.92）略低于其验证性能（F1: 0.95），表明存在轻微过拟合。
线性模型，尤其是logistic regression和线性SVM，在所有特征集上均优于Naïve Bayes和perceptrons。
基于LSI的特征并未优于n-gram方法，且使用LSI特征的模型F1-score较低。
表现最佳的模型具有鲁棒性和泛化能力，各折间性能一致，表明其具有强大的预测能力。
未来通过正则化、非线性模型以及高级特征工程（如POS词性标注、主题建模）有望进一步提升性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。