QUICK REVIEW

[论文解读] Revisiting the Reliability of Psychological Scales on Large Language Models

Jen-tse Huang, Wenxiang Jiao|arXiv (Cornell University)|May 31, 2023

Topic Modeling被引用 10

一句话总结

本论文分析在人类心理量表，尤其是 Big Five Inventory，在 GPT-3.5-turbo 上在 2,500 种不同设置下的可靠性，并表明通过提示调整可以表示多样的人格。

ABSTRACT

Recent research has focused on examining Large Language Models' (LLMs) characteristics from a psychological standpoint, acknowledging the necessity of understanding their behavioral characteristics. The administration of personality tests to LLMs has emerged as a noteworthy area in this context. However, the suitability of employing psychological scales, initially devised for humans, on LLMs is a matter of ongoing debate. Our study aims to determine the reliability of applying personality assessments to LLMs, explicitly investigating whether LLMs demonstrate consistent personality traits. Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory, indicating a satisfactory level of reliability. Furthermore, our research explores the potential of GPT-3.5 to emulate diverse personalities and represent various groups-a capability increasingly sought after in social sciences for substituting human participants with LLMs to reduce costs. Our findings reveal that LLMs have the potential to represent different personalities with specific prompt instructions.

研究动机与目标

评估为人类设计的心理量表在 LLMs 上的可靠性。
确定在不同提示和情境下 LLMs 是否表现出一致的人格特征。
调查指令、条目、语言与格式是否影响 LLM 人格测量。
探索通过提示驱动个性化，LLMs 是否能够代表多样的人类人群。

提出的方法

构建一个框架，改变五个因素（instruction、items、language、choice labels、choice order），以在 LLM 上为 Big Five Inventory 生成 2,500 个配置。
使用 gpt-3.5-turbo，temperature 为 0，收集每个设置的五维 OCEAN 分数。
用 GPT-4 重新表述条目并翻译成九种额外语言，以测试跨语言的可靠性。
通过随时间的重复提示进行评估内部一致性和重测信度（二周一次数据采集）。
对分布、离群值及与人类规范的偏差进行分析，以评估可靠性与变异性。

实验结果

研究问题

RQ1当在多样化的输入条件下将心理量表应用于 LLMs 时，是否能产生稳定、可靠的人格测量？
RQ2LLMs 是否能通过提示操控有意义地模拟多样的人类人格？
RQ3语言、条目释义和选项格式如何影响 LLM 人格分数？
RQ4在 GPT-3.5-turbo 的时间与设置中，是否存在一致的大五人格特征的证据？

主要发现

GPT-3.5-turbo 在各种提示和设置下的 Big Five Inventory 显示出令人满意的可靠性。
大多数因素变化并未产生显著的均值差异；只有少量比较的差异超过 0.15。
模型在 OCEAN 维度上的标准差小于通常的人群样本基线，表明回答更具确定性。
在使用阿拉伯数字、降序排列以及某些语言（阿拉伯语、汉语）时，离群值聚集，提示可能的理解变异。
影响人格的三种途径（环境、分配的人格、化身为某个角色）可以改变分布，其中通过化身为角色的塑形最为有效。
角色设定可以扩展所表示的人格光谱，尽管由于积极偏好，英雄角色的分布与默认情形相似。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。