QUICK REVIEW

[Paper Review] Is preprocessing of text really worth your time for online comment classification?

Fahim Mohammad|arXiv (Cornell University)|Jun 7, 2018

Hate Speech and Cyberbullying Detection18 references18 citations

TL;DR

This paper investigates whether extensive text preprocessing is necessary for classifying online comments as toxic or constructive. Using four state-of-the-art models on the Jigsaw dataset, it finds that minimal or no preprocessing often yields better performance than aggressive transformation, challenging the conventional wisdom that preprocessing significantly improves model accuracy in this domain.

ABSTRACT

A large proportion of online comments present on public domains are constructive, however a significant proportion are toxic in nature. The comments contain lot of typos which increases the number of features manifold, making the ML model difficult to train. Considering the fact that the data scientists spend approximately 80% of their time in collecting, cleaning and organizing their data [1], we explored how much effort should we invest in the preprocessing (transformation) of raw comments before feeding it to the state-of-the-art classification models. With the help of four models on Jigsaw toxic comment classification data, we demonstrated that the training of model without any transformation produce relatively decent model. Applying even basic transformations, in some cases, lead to worse performance and should be applied with caution.

Motivation & Objective

To evaluate the impact of text preprocessing on the performance of machine learning models in classifying online comments.
To determine whether the time and effort spent on preprocessing text are justified in the context of toxic comment detection.
To compare model performance across various preprocessing levels, from raw text to heavily transformed inputs.
To assess whether state-of-the-art models can achieve strong results without extensive data cleaning.

Proposed method

The study uses four deep learning and traditional machine learning models trained on the Jigsaw toxic comment classification dataset.
Preprocessing levels range from raw text (no transformation) to multiple stages including lowercasing, removing special characters, and lemmatization.
Models are evaluated using standard metrics such as AUC-ROC and F1-score across different preprocessing configurations.
Experiments are conducted with controlled variable settings to isolate the effect of preprocessing on model performance.
The analysis includes ablation studies to assess the contribution of each preprocessing step.

Experimental results

Research questions

RQ1Does applying extensive text preprocessing improve the performance of classification models on online comment data?
RQ2How does model performance vary when using raw text versus various levels of preprocessing?
RQ3Is the time investment in preprocessing justified by measurable improvements in classification accuracy?
RQ4Can state-of-the-art models achieve strong performance without any text preprocessing?

Key findings

Models trained on raw text without any preprocessing achieved competitive performance, often outperforming those with extensive preprocessing.
Basic preprocessing steps such as lowercasing and removing punctuation sometimes led to performance degradation.
The use of lemmatization and advanced cleaning techniques did not consistently improve model results and occasionally hurt performance.
The study found that the most effective models were those trained on minimal preprocessing, suggesting that modern models can handle noisy, raw text effectively.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.