QUICK REVIEW

[论文解读] Nationality and Region Prediction from Names: A Comparative Study of Neural Models and Large Language Models

Keito Inoshita|arXiv (Cornell University)|Jan 13, 2026

Names, Identity, and Discrimination Research被引用 1

一句话总结

该论文系统比较了六种神经模型和六种大语言模型提示策略在从姓名预测国籍与地区方面的表现，在所有粒度水平上大语言模型优于神经模型，且粒度越粗，性能差距越小，但错误模式各有特点。

ABSTRACT

Predicting nationality from personal names has practical value in marketing, demographic research, and genealogical studies. Conventional neural models learn statistical correspondences between names and nationalities from task-specific training data, posing challenges in generalizing to low-frequency nationalities and distinguishing similar nationalities within the same region. Large language models (LLMs) have the potential to address these challenges by leveraging world knowledge acquired during pre-training. In this study, we comprehensively compare neural models and LLMs on nationality prediction, evaluating six neural models and six LLM prompting strategies across three granularity levels (nationality, region, and continent), with frequency-based stratified analysis and error analysis. Results show that LLMs outperform neural models at all granularity levels, with the gap narrowing as granularity becomes coarser. Simple machine learning methods exhibit the highest frequency robustness, while pre-trained models and LLMs show degradation for low-frequency nationalities. Error analysis reveals that LLMs tend to make ``near-miss'' errors, predicting the correct region even when nationality is incorrect, whereas neural models exhibit more cross-regional errors and bias toward high-frequency classes. These findings indicate that LLM superiority stems from world knowledge, model selection should consider required granularity, and evaluation should account for error quality beyond accuracy.

研究动机与目标

为营销、人口统计和家谱学等应用，将个人姓名用于国籍/区域预测提供动机。
评估神经模型与大语言模型在不同频率的国籍以及同一区域内的区分度上如何泛化。
在三个粒度层次（国籍、地区、大陆）分析预测性能并识别错误模式。
研究提示设计与模型选择如何影响在知识密集型分类任务中的大语言模型能力。

提出的方法

评估六种神经基线：字符n-gram的SVM、fastText、CNN、BiLSTM、CANINE 和 XLM-RoBERTa。
评估六种大语言模型提示策略：零-shot、少量示例、思维链、自一致性、最少到最多、以及自我反思。
使用 name2nat 派生的数据集，筛选到 99 个国籍（训练/验证/测试分布 8:1:1），并进行分层抽样。
以准确率、Macro-F1、以及 Precision@k（k=2,3,5）评估性能，并进行基于频率的分层分析（头部/中部/尾部）。
对神经模型进行三次随机种子试验，并通过 API 对 GPT-4.1-mini 进行 LLM 提示；报告均值±标准差。
提供一个分层评估框架，比较细粒度（国籍）与粗粒度（地区、大陆）的预测，并分析错误类型。

实验结果

研究问题

RQ1神经模型在不同粒度水平的从姓名预测国籍与地区方面，与大语言模型相比有何差异？
RQ2哪些提示策略能使大语言模型更好地利用预训练的世界知识完成此任务，以及对低频国籍的预测鲁棒性如何？
RQ3预测粒度如何影响神经模型与大语言模型之间的性能差距，这对模型选择意味着什么？
RQ4在此任务中，大语言模型与神经模型的定性错误模式有哪些（近似错误 vs 跨区域错误）？

主要发现

大语言模型在所有粒度水平的国籍预测中显著优于神经模型。
随着粒度从国籍向地区再到大陆变粗，LLMs 与神经模型之间的绩效差距缩小。
简单的机器学习方法对高频国籍表现出鲁棒性，而预训练模型与大语言模型在低频国籍上表现下降。
大语言模型倾向于在正确地区上做近似错误，即预测的地区正确但国籍不正确；而神经模型则表现出更多跨区域错误并偏向高频类别。
提示设计会影响大语言模型的表现，自洽性与零-shot/少量示例变体在本研究中表现较强。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。