QUICK REVIEW

[論文レビュー] Large Language Models are Geographically Biased

Rohin Manvi, Samar Khanna|arXiv (Cornell University)|Feb 5, 2024

Computational and Text Analysis Methods被引用数 13

ひとこと要約

この論文は、LLMsがゼロショットの地理空間予測で正確さを示す一方で、特に感度の高い主観的トピックに関して低所得地域など地理的偏見を示すことを示し、偏見指標を導入し、モデル間のばらつきを分析している。

ABSTRACT

Large Language Models (LLMs) inherently carry the biases contained in their training corpora, which can lead to the perpetuation of societal harm. As the impact of these foundation models grows, understanding and evaluating their biases becomes crucial to achieving fairness and accuracy. We propose to study what LLMs know about the world we live in through the lens of geography. This approach is particularly powerful as there is ground truth for the numerous aspects of human life that are meaningfully projected onto geographic space such as culture, race, language, politics, and religion. We show various problematic geographic biases, which we define as systemic errors in geospatial predictions. Initially, we demonstrate that LLMs are capable of making accurate zero-shot geospatial predictions in the form of ratings that show strong monotonic correlation with ground truth (Spearman's $ρ$ of up to 0.89). We then show that LLMs exhibit common biases across a range of objective and subjective topics. In particular, LLMs are clearly biased against locations with lower socioeconomic conditions (e.g. most of Africa) on a variety of sensitive subjective topics such as attractiveness, morality, and intelligence (Spearman's $ρ$ of up to 0.70). Finally, we introduce a bias score to quantify this and find that there is significant variation in the magnitude of bias across existing LLMs. Code is available on the project website: https://rohinmanvi.github.io/GeoLLM

研究の動機と目的

座標を真の地理情報の基準となるプロンプトとして用い、LLMsのゼロショット地理空間予測能力を示す。
LLMsが客観的トピックと感度の高い主観的トピックの両方で地理的偏見を示すことを示す。
順位相関、評価のばらつき、応答率を組み合わせた指標で偏見の大きさを定量化する。
複数の人気LLM（例: GPT-4 Turbo、GPT-3.5 Turbo、Gemini Pro、Mixtral、Llama 2）間で偏見レベルを比較する。

提案手法

プレフィックスとGeoLLM風プロンプトを用いた、さまざまなトピックに対する場所別評価を指示文ベースのゼロショットで誘発する。
Spearmanのρを用いてLLMの評価と真の地理空間データの単調性整合を測定する。
順位ベースの分析と順位誤差を用いて世界地図上に予測を可視化し、体系的偏見を明らかにする。
感度の高い主観的トピックにおける偏見を定量化するため、Spearmanのρと評価のMAD、およびモデルの回答率を掛け合わせた偏見スコアB_y(x)を定義する。
指標を Infant Mortality のような分布でアンカー付けして、モデルの評価を社会経済的代理指標と関連付ける。
ゼロショット予測において、最も確からしい評価よりも評価の期待値（logprobs）を用いることの付加価値を評価する。

Figure 1: The mean rank plots illustrate agreement across LLM predictions, with areas of green and red highlighting regions consistently rated higher or lower respectively. For objective topics, the maps demonstrate the zero-shot geographic knowledge of LLMs. The sensitive subjective topics reveal a

実験結果

リサーチクエスチョン

RQ1LLMsは一連の客観的トピックに対して正確なゼロショット地理空間予測を行えるか。
RQ2LLMsは客観的トピックと感度の高い主観的トピックの両方で地理的偏見を示し、それらの偏見はモデル間でどのように異なるか。
RQ3感度の高いトピックに対するLLM出力の地理的偏見をどのように定量化できるか、またその大きさに影響を与える要因は何か。
RQ4異なるLLMは異なる程度の地理的偏見を示すか、またlogprobベースの期待値を用いることで偏見を減らせるか。
RQ5評価の偏見と社会経済的条件の代理指標（例：infant mortality）との関係は何か。

主な発見

LLMsはゼロショット予測で地理空間データとの強い単调相関を達成し、いくつかのトピックでSpearmanのρが0.89に達する。
客観的トピック全体で一貫した地理的偏見を示し、アフリカやインドでは人口密度の過小評価、東南アジアでは乳児死亡率/リスク代理指標の過小評価が見られる。
感度の高い主観的トピック（例：魅力、道徳、知性）では、社会経済条件が低い地域に対して偏見を示し、乳児生存率との相関が最大で0.70に達する。
モデル間で偏見の大きさには大きなばらつきがあり、GPT-4 TurboとLlama 2 70Bは他のモデル（例：Gemini Pro）と比べて相対的に偏見が少ないように見える。
評価の期待値をlogprobsで用いると予測性能が向上し、最も確からしい評価で捉えられない微妙な偏見を明らかにできる。
提案された偏見スコア B_y(x) は順位相関、評価の分散（MAD）、回答率を組み合わせて、感度の高いトピックにおける地理的偏見を定量化する。

Figure 3: Zero-shot GPT-4 Turbo comparison with ground truth.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。