[Paper Review] Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems
This paper introduces the Equity Evaluation Corpus (EEC), a benchmark dataset of 8,640 English sentences designed to detect gender and race bias in sentiment analysis systems. Using the EEC, the authors evaluate 219 systems from SemEval-2018 Task 1 and find that over 75% exhibit statistically significant bias, consistently assigning higher sentiment intensity to sentences involving one gender or race, with biases reaching up to 34% in magnitude.
Automatic machine learning systems can inadvertently accentuate and perpetuate inappropriate human biases. Past work on examining inappropriate biases has largely focused on just individual systems. Further, there is no benchmark dataset for examining inappropriate biases in systems. Here for the first time, we present the Equity Evaluation Corpus (EEC), which consists of 8,640 English sentences carefully chosen to tease out biases towards certain races and genders. We use the dataset to examine 219 automatic sentiment analysis systems that took part in a recent shared task, SemEval-2018 Task 1 'Affect in Tweets'. We find that several of the systems show statistically significant bias; that is, they consistently provide slightly higher sentiment intensity predictions for one race or one gender. We make the EEC freely available.
Motivation & Objective
- To identify and measure gender and race bias in automatic sentiment analysis systems.
- To develop a standardized benchmark dataset for evaluating fairness in NLP systems.
- To examine whether sentiment intensity predictions vary systematically based on the gender or race of individuals mentioned in text.
- To assess the extent to which different affect dimensions (e.g., anger, fear, valence) are affected by such biases.
- To provide a publicly available resource for developers and researchers to audit and improve the fairness of sentiment analysis systems.
Proposed method
- The Equity Evaluation Corpus (EEC) was constructed with 8,640 sentences carefully paired to differ only in a single word indicating gender or race, enabling controlled comparison.
- The EEC was used as a supplementary test set in SemEval-2018 Task 1, which evaluated sentiment and emotion intensity in tweets.
- Systems were evaluated by comparing their predicted sentiment intensity scores on sentence pairs differing only in gender or race.
- Statistical significance tests were applied to detect consistent score differences favoring one gender or race across multiple sentence pairs.
- A baseline SVM system trained on unigrams alone was evaluated to isolate bias originating from training data.
- The analysis compared bias across different affect dimensions, including anger, fear, sadness, and valence intensity.
Experimental results
Research questions
- RQ1Do sentiment analysis systems exhibit statistically significant bias in predicting sentiment intensity when the only difference between sentences is the gender of the person mentioned?
- RQ2Do systems show similar bias when the only difference is the race of the person mentioned, particularly between European American and African American names?
- RQ3How does the magnitude and direction of bias vary across different emotion intensity dimensions such as anger, fear, sadness, and valence?
- RQ4To what extent is bias present in systems that do not use pre-trained word embeddings or external lexicons, suggesting data-level bias?
- RQ5Can the same system show different bias patterns depending on the specific affect dimension it is predicting?
Key findings
- More than 75% of the 219 sentiment analysis systems evaluated showed statistically significant bias in sentiment intensity predictions based on gender or race.
- The average bias magnitude was less than 0.03 (3%) of the 0 to 1 score range, but some systems exhibited biases as high as 0.34 (34%).
- Race-based bias was more prevalent than gender-based bias, with systems consistently assigning higher sentiment intensity to sentences involving European American names.
- Even a simple baseline SVM system trained only on unigrams showed small but significant bias, indicating that bias originates in the training data.
- The direction of bias varied by affect dimension: for example, male-mentioning sentences received higher anger and fear scores, while female-mentioning sentences were rated higher on valence in some cases.
- Systems that showed no significant bias on the EEC tended to perform worse on the main SemEval-2018 test sets, suggesting a possible trade-off between fairness and accuracy.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.