[论文解读] Predicting Race and Ethnicity From the Sequence of Characters in a Name
本论文比较多种模型(KNN、RF、GB、LSTM、Transformer),使用姓氏和全名数据来从姓名预测种族/族裔,发现 LSTM 倾向表现最好,姓氏与全名的表现差异显著。
To answer questions about racial inequality and fairness, we often need a way to infer race and ethnicity from names. One way to infer race and ethnicity from names is by relying on the Census Bureau's list of popular last names. The list, however, suffers from at least three limitations: 1. it only contains last names, 2. it only includes popular last names, and 3. it is updated once every 10 years. To provide better generalization, and higher accuracy when first names are available, we model the relationship between characters in a name and race and ethnicity using various techniques. A model using Long Short-Term Memory works best with out-of-sample accuracy of .85. The best-performing last-name model achieves out-of-sample accuracy of .81. To illustrate the utility of the models, we apply them to campaign finance data to estimate the share of donations made by people of various racial groups, and to news data to estimate the coverage of various races and ethnicities in the news.
研究动机与目标
- 研究动机:需要从姓名推断种族/族裔以研究不平等与公平性。
- Critique limitations of Census-based last-name lists (limited to last names, popularity bias, decennial updates).
- Develop and compare models that use character sequences from names to predict five ethno-racial categories.
- Assess generalization by testing on hold-out data and census-based datasets.
- Demonstrate practical applications in politics and media diversity.
提出的方法
- 将姓名处理为标题大小写,去除非字母字符,并连接姓+名或全名。
- 探索多种分类器:KNN(编辑距离)、随机森林、梯度提升树、LSTM、Transformer。
- 按姓氏或全名分组数据并计算每组的优势族裔类别。
- 将数据按0.8/0.1/0.1的比例分为训练/验证/测试集。
- 根据数据集(佛州选民数据和 Census 数据)按类别和整体评估样本外准确性。
- 可选地用合成数据进行扩充(未发现显著收益)。
实验结果
研究问题
- RQ1使用不同建模方法,姓名序列能多么准确地预测种族/族裔?
- RQ2包括名字符合的全名模型是否显著提升相对于仅姓氏模型的预测能力?
- RQ3哪种模型类型(KNN、RF/GB、LSTM、Transformer)在跨姓名数据集的样本外性能最佳?
- RQ4模型在主要族裔类别(NH White、NH Black、Hispanic、Asian、Other)及总体上的表现如何?
- RQ5基于姓名的种族推断在竞选资金、新闻室多样性等应用中的实际意义是什么?
主要发现
- 姓氏模型:在复杂模型中,LSTM 达到最高的样本外准确性(0.81 总体;NH White 0.91;NH Black 0.50;Hispanic 0.84;Asian 0.40;Other 0.04)。
- 全名模型:LSTM 超过姓氏模型,总体准确性0.85(NH White 0.92;NH Black 0.76;Hispanic 0.86;Asian 0.63;Other 0.07)。
- KNN 基线具有竞争力,姓氏 KNN(余弦距离)在52k 保留集上约0.78,全名 KNN约0.73。
- 在全名模型中,LSTM 再次支配其他架构(RF、GB、Transformer)的总体及按类别表现。
- 增加合成数据并未显著提高准确性。
- 具体应用示例:按种族的竞选捐款(佛州全名 LSTM)和新闻室多样性(Top News 数据)显示作者与提及的族群偏差。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。