QUICK REVIEW

[논문 리뷰] Predicting Race and Ethnicity From the Sequence of Characters in a Name

Chintalapati, Rajashekar, Suriyan Laohaprapanon|arXiv (Cornell University)|2018. 05. 05.

Names, Identity, and Discrimination Research인용 수 101

한 줄 요약

논문은 여러 모델(KNN, RF, GB, LSTM, Transformer)을 사용하여 이름의 성(last-name)과 전체이름(full-name) 데이터를 이용해 인종/민족을 예측하고, LSTM이 일반적으로 가장 우수한 성능을 보이며 성(last-name)와 전체 이름(full-name) 간 성능 차이가 뚜렷하다는 것을 발견한다.

ABSTRACT

To answer questions about racial inequality and fairness, we often need a way to infer race and ethnicity from names. One way to infer race and ethnicity from names is by relying on the Census Bureau's list of popular last names. The list, however, suffers from at least three limitations: 1. it only contains last names, 2. it only includes popular last names, and 3. it is updated once every 10 years. To provide better generalization, and higher accuracy when first names are available, we model the relationship between characters in a name and race and ethnicity using various techniques. A model using Long Short-Term Memory works best with out-of-sample accuracy of .85. The best-performing last-name model achieves out-of-sample accuracy of .81. To illustrate the utility of the models, we apply them to campaign finance data to estimate the share of donations made by people of various racial groups, and to news data to estimate the coverage of various races and ethnicities in the news.

연구 동기 및 목표

이름을 통해 불평등과 공정성을 연구하기 위한 인종/민족 추론 필요성의 필요성 제시.
Census 기반 성 목록의 한계점(성에만 국한, 인기 편향, 10년마다 업데이트) 비판.
문자 시퀀스를 이용해 다섯 가지 인종-민족 범주를 예측하는 모델을 개발하고 비교.
홀드아웃 데이터 및 Census 기반 데이터셋으로 일반화 능력 평가.
정치 및 매체 다양성에서의 실용적 응용 시나리오 시연

제안 방법

이름을 대문자 제목 표기로 변환하고 비알파벳 문자를 제거한 뒤 성+이름 또는 전체 이름을 연결한다.
여러 분류기 탐색: 편집거리(KNN), Random Forest, Gradient Boosted Trees, LSTM, Transformer.
데이터를 성(last name) 또는 전체 이름(full name)으로 그룹화하고 각 그룹의 최빈 인종/민족 범주를 계산한다.
데이터를 0.8/0.1/0.1 비율로 학습/검증/테스트 세트로 분할한다.
데이터셋(플로리다 유권자 데이터 및 Census 데이터)에서 범주별 및 전체적으로 샘플 외(out-of-sample) 정확도를 평가한다.
선택적으로 합성 데이터로 보강했으나 뚜렷한 이득이 없었다.

실험 결과

연구 질문

RQ1다양한 모델링 접근법을 사용해 이름 시퀀스로 인종/민족을 얼마나 정확하게 예측할 수 있는가?
RQ2전체 이름 모델(이름 포함)을 도입하면 성만 사용하는 모델에 비해 예측력이 현저히 향상되는가?
RQ3어떤 모델 유형(KNN/RF/GB/LSTM/Transformer)이 이름 데이터셋에서 가장 우수한 샘플 외 성능을 보이는가?
RQ4주요 인종-민족 범주(NH White, NH Black, Hispanic, Asian, Other) 및 전체적으로 모델의 성능은 어떠한가?
RQ5이름 기반 인종 추론의 결과가 정치 자금, 매스컴 다양성 등 실제 응용에 어떤 시사점을 가지는가?

주요 결과

Last-name 모델에서 LSTM이 복잡한 모델 중에서 외부 데이터(out-of-sample) 정확도가 가장 높음(0.81 전체; NH White 0.91; NH Black 0.50; Hispanic 0.84; Asian 0.40; Other 0.04).
Full-name 모델에서 LSTM이 Last-name 모델을 능가하며 전체 정확도 0.85(NH White 0.92; NH Black 0.76; Hispanic 0.86; Asian 0.63; Other 0.07).
KNN 베이스라인도 경쟁력 있으며, Last-name KNN(코시안 거리)으로 52k 홀드아웃에서 약 0.78 정확도, Full-name KNN은 약 0.73 전체.
Full-name 모델 중에서도 LSTM이 RF, GB, Transformer보다 전반적 및 카테고리별 성능에서 우세.
합성 데이터 추가가 정확도를 의미 있게 향상시키지 못함.
구체적 응용 시나리오 제시: 인종별 캠페인 기여(플로리다 전체 이름 LSTM) 및 newsroom 다양성(Top News 데이터)에서 저자 및 언급의 인구통계학적 편향 확인

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.