[Paper Review] Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis
The study psychometrically evaluates GPT-3.5 and GPT-4’s ability to simulate human personality traits using generic and silicon personas; GPT-4 shows some psychometric promise with generic prompts, but both models fail to reliably simulate latent traits, especially with silicon personas.
The humanlike responses of large language models (LLMs) have prompted social scientists to investigate whether LLMs can be used to simulate human participants in experiments, opinion polls and surveys. Of central interest in this line of research has been mapping out the psychological profiles of LLMs by prompting them to respond to standardized questionnaires. The conflicting findings of this research are unsurprising given that mapping out underlying, or latent, traits from LLMs' text responses to questionnaires is no easy task. To address this, we use psychometrics, the science of psychological measurement. In this study, we prompt OpenAI's flagship models, GPT-3.5 and GPT-4, to assume different personas and respond to a range of standardized measures of personality constructs. We used two kinds of persona descriptions: either generic (four or five random person descriptions) or specific (mostly demographics of actual humans from a large-scale human dataset). We found that the responses from GPT-4, but not GPT-3.5, using generic persona descriptions show promising, albeit not perfect, psychometric properties, similar to human norms, but the data from both LLMs when using specific demographic profiles, show poor psychometrics properties. We conclude that, currently, when LLMs are asked to simulate silicon personas, their responses are poor signals of potentially underlying latent traits. Thus, our work casts doubt on LLMs' ability to simulate individual-level human behaviour across multiple-choice question answering tasks.
Motivation & Objective
- Assess whether GPT-3.5 and GPT-4 can simulate human psychological profiles using standardized measures.
- Evaluate reliability and validity of LLM responses under generic versus silicon prompting.
- Compare LLM responses to a large human baseline dataset for personality and related constructs.
Proposed method
- Prompt two OpenAI models (GPT-3.5 and GPT-4) with two persona types: generic (random short descriptions) and silicon (demographic-based).
- Administer 104-item battery including Big Five Inventory and eight related personality measures across 239,200 prompts.
- Process text responses to extract numeric item answers by taking the first digit within a token-limited response.
- Evaluate reliability with Cronbach’s alpha and related indices; assess construct validity via inter-factor correlations and criterion validity correlations; perform confirmatory factor analyses.
Experimental results
Research questions
- RQ1Can GPT-3.5 and GPT-4 produce reliable and valid Big Five and related trait measures under generic and silicon prompting?
- RQ2Do silicon personas yield psychometrically sound representations of latent traits compared to generic personas?
- RQ3How do LLM-based trait profiles compare to a large human baseline in terms of reliability, validity, and factor structure?
Key findings
- GPT-4 with generic personas shows acceptable internal consistency for most subscales (α ≥ .70) unlike GPT-3.5 in some subscales; silicon personas produce low reliability for both models.
- LLMs tend to show higher intercorrelations among Big Five traits than humans, indicating reduced discriminant validity, especially with generic prompts; silicon prompts show more ambiguity.
- Criterion validity for generic-prompt data is stronger, with GPT-4 performing better than GPT-3.5, while silicon-prompt data exhibit significantly weaker correlations with external criteria.
- Confirmatory factor analyses reveal poor structural validity for LLM data; Big Five structure is not reliably recoverable, notably with silicon prompting and/or GPT-4 generic prompting.
- Trait bias analyses show GPT-4 generally similar to GPT-3.5 in average bias, with small but significant differences for Agreeableness; bias relates to certain personality features rather than demographics.
- Across models, GPT-4 tends to outperform GPT-3.5 on some psychometric properties, but neither model reliably mimics latent human traits across tasks.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.