QUICK REVIEW

[论文解读] Is preprocessing of text really worth your time for online comment classification?

Fahim Mohammad|arXiv (Cornell University)|Jun 7, 2018

Hate Speech and Cyberbullying Detection参考文献 18被引用 18

一句话总结

本文研究了在将在线评论分类为有毒或建设性内容时，是否需要进行广泛的文本预处理。基于 Jigsaw 数据集上的四个最先进模型，研究发现与激进的文本转换相比，极少或无需预处理通常能取得更好的性能，挑战了该领域中预处理能显著提升模型准确率的普遍认知。

ABSTRACT

A large proportion of online comments present on public domains are constructive, however a significant proportion are toxic in nature. The comments contain lot of typos which increases the number of features manifold, making the ML model difficult to train. Considering the fact that the data scientists spend approximately 80% of their time in collecting, cleaning and organizing their data [1], we explored how much effort should we invest in the preprocessing (transformation) of raw comments before feeding it to the state-of-the-art classification models. With the help of four models on Jigsaw toxic comment classification data, we demonstrated that the training of model without any transformation produce relatively decent model. Applying even basic transformations, in some cases, lead to worse performance and should be applied with caution.

研究动机与目标

评估文本预处理对机器学习模型在在线评论分类中性能的影响。
确定在毒性评论检测背景下，投入时间与精力进行文本预处理是否合理。
比较从原始文本到高度转换输入的各种预处理级别下的模型性能。
评估最先进模型是否能在无需大量数据清洗的情况下取得优异结果。

提出的方法

本研究使用四个深度学习和传统机器学习模型，在 Jigsaw 毒性评论分类数据集上进行训练。
预处理级别从原始文本（无转换）到多个阶段（包括小写转换、特殊字符移除和词形还原）不等。
使用 AUC-ROC 和 F1 分数等标准指标，在不同预处理配置下评估模型性能。
通过控制变量设置进行实验，以隔离预处理对模型性能的影响。
分析中包含消融研究，以评估每一项预处理步骤的贡献。

实验结果

研究问题

RQ1在在线评论数据上应用广泛文本预处理是否能提升分类模型的性能？
RQ2在使用原始文本与不同级别预处理时，模型性能如何变化？
RQ3预处理所投入的时间是否通过分类准确率的可测量提升而得到合理回报？
RQ4最先进模型是否能在不进行任何文本预处理的情况下取得优异性能？

主要发现

在未经任何预处理的原始文本上训练的模型取得了具有竞争力的性能，通常优于经过广泛预处理的模型。
诸如小写转换和标点符号移除等基础预处理步骤有时反而导致性能下降。
使用词形还原和高级清洗技术并未一致提升模型结果，有时甚至损害了性能。
研究发现，性能最佳的模型是那些在最小预处理下训练的，表明现代模型能够有效处理嘈杂的原始文本。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。