QUICK REVIEW

[论文解读] Predicting a Business Star in Yelp from Its Reviews Text Alone

Mingming Fan, Maryam Khademi|arXiv (Cornell University)|Jan 5, 2014

Sentiment Analysis and Opinion Mining参考文献 10被引用 27

一句话总结

本文提出了一种仅使用用户评论文本预测Yelp商家评分（1–5星）的方法，消除了人工评分中的主观性。通过结合频繁词的词袋特征与词性标注的形容词特征，并使用线性回归，该方法实现了0.6的均方根误差（RMSE），展示了对评论情感的有效自动化摘要。

ABSTRACT

Yelp online reviews are invaluable source of information for users to choose where to visit or what to eat among numerous available options. But due to overwhelming number of reviews, it is almost impossible for users to go through all reviews and find the information they are looking for. To provide a business overview, one solution is to give the business a 1-5 star(s). This rating can be subjective and biased toward users personality. In this paper, we predict a business rating based on user-generated reviews texts alone. This not only provides an overview of plentiful long review texts but also cancels out subjectivity. Selecting the restaurant category from Yelp Dataset Challenge, we use a combination of three feature generation methods as well as four machine learning models to find the best prediction result. Our approach is to create bag of words from the top frequent words in all raw text reviews, or top frequent words/adjectives from results of Part-of-Speech analysis. Our results show Root Mean Square Error (RMSE) of 0.6 for the combination of Linear Regression with either of the top frequent words from raw data or top frequent adjectives after Part-of-Speech (POS).

研究动机与目标

开发一种无需依赖用户提供的星级评分来预测Yelp商家评分的方法。
通过仅使用评论的文本内容，减少商家评分中的主观性和偏见。
为商家评估提供一种自动化、可扩展的长篇评论文本摘要。
评估不同特征提取技术与机器学习模型在评分预测中的有效性。
确定仅从文本中提取的情感与词汇特征是否能准确预测星级评分。

提出的方法

使用所有原始评论文本中频率最高的词汇的词袋特征进行特征工程。
使用词性（POS）标注技术从评论语料库中提取高频形容词进行特征提取。
将两组特征——高频词与高频形容词——合并为单一特征向量。
在组合特征集上训练并评估四种机器学习模型，包括线性回归。
使用均方根误差（RMSE）作为模型性能的主要评估指标。
从Yelp数据集挑战中选择餐厅类别作为模型训练与测试的数据集。

实验结果

研究问题

RQ1在无法访问显式星级评分的情况下，能否仅通过用户评论的文本内容准确预测商家评分？
RQ2不同的特征提取方法——原始高频词与POS标注的形容词——如何影响预测性能？
RQ3在从文本评论数据预测Yelp评分时，哪种机器学习模型表现最佳？
RQ4评论中的情感与词汇内容在多大程度上能够预测整体商家评分？
RQ5仅使用文本特征预测1–5星评分时，可达到的最低误差是多少？

主要发现

使用原始文本中高频词的线性回归模型实现了0.6的RMSE。
使用POS分析中高频形容词的同一模型也实现了0.6的RMSE。
表现最佳的模型配置实现了0.6的均方根误差，表明具有较强的预测准确性。
基于POS标注形容词提取的特征集性能与基于原始高频词的特征集相当。
结果表明，仅通过文本情感与词汇内容即可有效预测商家评分。
该方法通过完全依赖评论文本，成功减少了人工评分中固有的主观性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。