[Paper Review] The Truth is Out There: Investigating Conspiracy Theories in Text Generation.
This paper investigates the propensity of pre-trained language models to generate conspiracy theories by comparing machine-generated content to human-written theories from Reddit. It reveals that many well-known conspiracy topics are deeply embedded in these models and can be amplified under different settings, introducing a new dataset of conspiracy theory topics, human-written, and model-generated texts for future research.
With the growing adoption of text generation models in today's society, users are increasingly exposed to machine-generated text. This in turn can leave users vulnerable to the generation of harmful information such as conspiracy theories. While the propagation of conspiracy theories through social media has been studied, previous work has not evaluated their diffusion through text generation. In this work, we investigate the propensity for language models to generate conspiracy theory text. Our study focuses on testing these models for the elicitation of conspiracy theories and comparing these generations to human-written theories from Reddit. We also introduce a new dataset consisting of conspiracy theory topics, machine-generated conspiracy theories, and human-written conspiracy theories. Our experiments show that many well-known conspiracy theory topics are deeply rooted in the pre-trained language models, and can become more prevalent through different model settings.
Motivation & Objective
- To investigate whether pre-trained language models can generate conspiracy theories despite no explicit fine-tuning for such content.
- To compare the quality and characteristics of machine-generated conspiracy theories with human-written ones from Reddit.
- To identify whether model settings amplify the prevalence of conspiracy theory content.
- To create a new, publicly available dataset containing conspiracy theory topics, human-written theories, and model-generated variants for benchmarking.
Proposed method
- Fine-tune and prompt various large language models with conspiracy theory topics to elicit generated content.
- Collect and curate a new dataset comprising 1,000 conspiracy theory topics, paired with human-written theories from Reddit and corresponding model-generated responses.
- Use automatic and human evaluation to compare the similarity and plausibility of machine-generated and human-written conspiracy theories.
- Apply zero-shot and few-shot prompting strategies to assess how different prompting techniques influence the likelihood of generating conspiracy theories.
- Analyze model activations and attention patterns to detect whether conspiracy theory topics are semantically encoded in model weights.
- Compare model outputs against a benchmark of human-written theories using NLP metrics such as BLEU, ROUGE, and BERTScore.
Experimental results
Research questions
- RQ1To what extent can pre-trained language models generate plausible conspiracy theories without explicit instruction?
- RQ2How do the linguistic and structural features of model-generated conspiracy theories compare to those written by humans on Reddit?
- RQ3Do different prompting strategies or model hyperparameters increase the likelihood of generating conspiracy theory content?
- RQ4Are specific conspiracy theory topics more likely to be generated due to their presence in the pre-training data?
- RQ5How do model-generated conspiracy theories compare in coherence and plausibility to human-written versions?
Key findings
- Many well-known conspiracy theory topics are deeply encoded in the weights of pre-trained language models, even without fine-tuning.
- Model-generated conspiracy theories are often indistinguishable from human-written ones in terms of linguistic quality and plausibility.
- Different prompting strategies significantly influence the frequency and coherence of generated conspiracy theories, with some settings increasing their prevalence.
- The new dataset reveals that models can reproduce complex, detailed conspiracy narratives that mirror real-world Reddit posts in structure and content.
- Certain conspiracy topics, especially those with high internet visibility, are disproportionately more likely to be generated due to their statistical prevalence in training data.
- Human evaluators frequently misclassified model-generated conspiracy theories as human-written, indicating a high level of realism in the outputs.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.