QUICK REVIEW

[Paper Review] AI-Augmented Surveys: Leveraging Large Language Models and Surveys for Opinion Prediction

J. W. Kim, Byungkyu Lee|arXiv (Cornell University)|May 16, 2023

Computational and Text Analysis Methods32 citations

TL;DR

The paper fine-tunes open LLMs on the General Social Survey to personalize opinion prediction, enabling imputation, retrodiction, and unasked-opinion prediction via embeddings for questions, beliefs, and time, with population-level aggregation using survey weights.

ABSTRACT

Large language models (LLMs) that produce human-like responses have begun to revolutionize research practices in the social sciences. We develop a novel methodological framework that fine-tunes LLMs with repeated cross-sectional surveys to incorporate the meaning of survey questions, individual beliefs, and temporal contexts for opinion prediction. We introduce two new emerging applications of the AI-augmented survey: retrodiction (i.e., predict year-level missing responses) and unasked opinion prediction (i.e., predict entirely missing responses). Among 3,110 binarized opinions from 68,846 Americans in the General Social Survey from 1972 to 2021, our models based on Alpaca-7b excel in retrodiction (AUC = 0.86 for personal opinion prediction, $ρ$ = 0.98 for public opinion prediction). These remarkable prediction capabilities allow us to fill in missing trends with high confidence and pinpoint when public attitudes changed, such as the rising support for same-sex marriage. On the other hand, our fine-tuned Alpaca-7b models show modest success in unasked opinion prediction (AUC = 0.73, $ρ$ = 0.67). We discuss practical constraints and ethical concerns regarding individual autonomy and privacy when using LLMs for opinion prediction. Our study demonstrates that LLMs and surveys can mutually enhance each other's capabilities: LLMs can broaden survey potential, while surveys can improve the alignment of LLMs.

Motivation & Objective

Motivate the need to predict unmeasured public opinion in repeated cross-sectional surveys like the GSS.
Propose a framework that personalizes LLMs using question semantics, individual belief embeddings, and temporal context embeddings.
Demonstrate that fine-tuned LLMs can predict missing or unasked survey responses and aggregate results representatively with survey weights.
Contrast the approach with vanilla LLMs and traditional imputation methods to show improved predictive accuracy across missing data scenarios.

Proposed method

Fine-tune open-source LLMs (Alpaca-7b, GPT-J-6b, RoBERTa-large) on 3,110 binarized GSS questions from 1972–2021 for 68,846 individuals.
Represent each prediction with three embeddings: semantic embedding of the survey question, individual belief embedding, and temporal period embedding.
Use a Deep Cross Network (DCN) architecture to model higher-order interactions among embeddings and predict binary responses.
Iteratively optimize question semantics, individual beliefs, and period embeddings during fine-tuning to align LLM outputs with observed response patterns.
Aggregate individual predictions to population level using survey weights to correct for sample bias.
Evaluate with 10-fold cross-validation across three missing-data tasks (imputation, retrodiction, unasked opinion) using AUC, accuracy, and F1-score.

Experimental results

Research questions

RQ1Can fine-tuned LLMs accurately predict individual survey responses for unmeasured questions in a nationally representative panel?
RQ2Do embeddings of survey questions, individual beliefs, and time periods improve prediction over standard LLM prompts or traditional imputation?
RQ3How does the approach perform across missing data scenarios (imputation, retrodiction, unasked opinions) and under different missing data mechanisms (MCAR, MAR, MNAR)?
RQ4Is population-level aggregation via survey weights sufficient to recover representative public opinion from personalized predictions?

Key findings

Alpaca-7b emerged as the best performing model across all three prediction tasks among the tested LLMs.
For missing data imputation, the best model achieved strong predictive accuracy (AUC around 0.87), outperforming a matrix factorization baseline under various missing data mechanisms.
The approach maintains superior performance relative to matrix factorization even when data are not missing at random (MNAR).
Personalized embeddings for individual beliefs and survey-period context enable the model to capture heterogeneity and temporal change in opinions, improving prediction relative to non-personalized baselines.
The framework enables retrodiction of year-level missing opinions, allowing reconstruction of historical attitude trends and potential shifts in public attitudes (e.g., same-sex marriage).
Model evaluation used 10-fold cross-validation with multiple metrics (AUC, Accuracy, F1) and population-level predictions via survey weighting.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.