[论文解读] Conceptual Cultural Index: A Metric for Cultural Specificity via Relative Generality
CCI 是一个句子级别的度量,通过将目标文化的一般性分数与其他文化的平均一般性进行比较来量化文化特异性,从而实现对语言模型的文化感知评估。
Large language models (LLMs) are increasingly deployed in multicultural settings; however, systematic evaluation of cultural specificity at the sentence level remains underexplored. We propose the Conceptual Cultural Index (CCI), which estimates cultural specificity at the sentence level. CCI is defined as the difference between the generality estimate within the target culture and the average generality estimate across other cultures. This formulation enables users to operationally control the scope of culture via comparison settings and provides interpretability, since the score derives from the underlying generality estimates. We validate CCI on 400 sentences (200 culture-specific and 200 general), and the resulting score distribution exhibits the anticipated pattern: higher for culture-specific sentences and lower for general ones. For binary separability, CCI outperforms direct LLM scoring, yielding more than a 10-point improvement in AUC for models specialized to the target culture. Our code is available at https://github.com/IyatomiLab/CCI .
研究动机与目标
- Define a sentence-level metric for cultural specificity (CCI) that is interpretable and controllable via the set of comparison cultures.
- Demonstrate that CCI provides clearer separation between culture-specific and general sentences than direct LLM scoring.
- Show how CCI can stratify benchmarks and reveal performance shifts as cultural specificity varies.
- Provide guidance on how CCI can be used for culture-aware evaluation and data curation.
提出的方法
- Use an LLM to estimate sentence generality p_c(x) for each culture c in a set C.
- Compute per-sentence CCI(x; t, C) as the difference between the target culture generality and the average generality across other cultures: CCI(x; t, C) = p_t(x)¯ − (1/|C|−1)∑_{c∈C\t} p_c(x)¯.
- Average results over N independent runs to mitigate variability (N=3 in experiments).
- Optionally compare with a direct-output baseline that predicts a [0,1] culture-specificity score.
- Investigate controllability by varying C (Global mode with 19 economies vs. Custom mode with neighboring cultures).
- Apply CCI to stratify benchmarks by CCI levels and analyze model performance shifts.
实验结果
研究问题
- RQ1Can CCI reliably distinguish culture-specific sentences from general sentences at sentence level?
- RQ2Does CCI offer better separability (AUC) than direct baseline scoring for cultural specificity?
- RQ3How does changing the comparison culture set C affect CCI scores and cultural scope controllability?
- RQ4Can CCI-based stratification reveal performance gaps as cultural specificity increases?
- RQ5What is the practical utility of CCI for culture-aware benchmarking and data curation?
主要发现
- CCI achieves comparable or higher AUC than the baseline and yields clearer separation between culture-specific and general sentences.
- Models with strong reasoning and cross-cultural knowledge (including Japanese-specialized models) show better separability for CCI.
- Custom mode (including neighboring cultures) reduces the median CCI for culture-specific items, indicating controllable cultural scope.
- Higher-CCI items tend to be more challenging for models, with accuracy generally decreasing as CCI increases (JCQA and JCM datasets).
- llm-jp shows comparatively smaller accuracy drop in high-CCI bins, suggesting Japanese-trained models benefit on culture-specific content.
- CCI provides interpretable per-culture generality scores alongside the target-culture specificity score, enabling culture-aware analysis.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。