QUICK REVIEW

[論文レビュー] Personality testing of Large Language Models: Limited temporal stability, but highlighted prosociality

Bojana Bodroža, Bojana M. Dinić|arXiv (Cornell University)|Jun 7, 2023

Artificial Intelligence in Healthcare and Education被引用数 10

ひとこと要約

この論文は、2つの時点における性格測定に対する7つのLLMsの時系列安定性と評価者間一致を評価し、一致は変動的で、主に利他的なプロファイルが見られることを示しています。

ABSTRACT

As Large Language Models (LLMs) continue to gain popularity due to their human-like traits and the intimacy they offer to users, their societal impact inevitably expands. This leads to the rising necessity for comprehensive studies to fully understand LLMs and reveal their potential opportunities, drawbacks, and overall societal impact. With that in mind, this research conducted an extensive investigation into seven LLM's, aiming to assess the temporal stability and inter-rater agreement on their responses on personality instruments in two time points. In addition, LLMs personality profile was analyzed and compared to human normative data. The findings revealed varying levels of inter-rater agreement in the LLMs responses over a short time, with some LLMs showing higher agreement (e.g., LIama3 and GPT-4o) compared to others (e.g., GPT-4 and Gemini). Furthermore, agreement depended on used instruments as well as on domain or trait. This implies the variable robustness in LLMs' ability to reliably simulate stable personality characteristics. In the case of scales which showed at least fair agreement, LLMs displayed mostly a socially desirable profile in both agentic and communal domains, as well as a prosocial personality profile reflected in higher agreeableness and conscientiousness and lower Machiavellianism. Exhibiting temporal stability and coherent responses on personality traits is crucial for AI systems due to their societal impact and AI safety concerns.

研究の動機と目的

LLMの性格評価が時間と計測手法をまたいでどれだけ安定しているかを理解する動機づけ。
LLMの性格検査に対する回答について、評価者間の一致を評価する。
LLMの性格プロファイルを人間の規範データと比較する。
どのモデルと計測手法がより信頼できる性格特性の信号を生み出すかを特定する。
一貫した性格シミュレーションのAI安全と社会的影響への含意を強調する。

提案手法

標準化された性格測定に対して7つのLLMsを2つの時点で評価する。
評価者間の一致を測定する。
一致が計測手法と特性領域によってどのように変化するかを分析する。
LLM由来の性格プロファイルを人間の規範データと比較する。
回答の時系列安定性と特性間の整合性を評価する。

実験結果

リサーチクエスチョン

RQ1LLMsは2つの時点をまたぐ性格テスト回答の時系列安定性を示すか？
RQ2LLMの回答を評価する際、異なる評価者間の一致はどの程度か？
RQ3一致は計測手法または特性領域（能動的 vs 共感的）によって異なるか？
RQ4LLMの性格プロファイルは社会的に望ましいか、そして人間の規範とどのように比較されるか？
RQ5どのモデルが性格評価においてより高い安定性と一致を示すか？

主な発見

評価者間の一致は短期間で変動し、モデルによっては他より高い一致を示す。
一致は使用された計測手法および性格領域（特性）によって異なる。
公正な一致が見られる尺度では、LLMは能動的および共感的領域で社会的に望ましいプロファイルに向かう。
LLMsは利他的な性格プロファイルを示し、協調性と良心性が高く、マキャヴェリアニズムは低い。
時系列安定性は限られており、一定の性格シミュレーションのAI安全性と社会的影響への配慮を生じさせる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。