QUICK REVIEW

[论文解读] Mapping of Subjective Accounts into Interpreted Clusters (MOSAIC): Topic Modelling and LLM applied to Stroboscopic Phenomenology

Romy Beauté, David J. Schwartzman|ArXiv.org|Feb 25, 2025

Advanced Text Analysis Techniques被引用 3

一句话总结

该论文提出 MOSAIC，这是一种使用 BERTopic 主题建模和 LLM 自动标注开放式 Dreamachine 报告主题、揭示刺激性现象学潜在体验主题的开源 NLP 流水线。

ABSTRACT

Stroboscopic light stimulation (SLS) on closed eyes typically induces simple visual hallucinations (VHs), characterised by vivid, geometric and colourful patterns. A dataset of 862 sentences, extracted from 422 open subjective reports, was recently compiled as part of the Dreamachine programme (Collective Act, 2022), an immersive multisensory experience that combines SLS and spatial sound in a collective setting. Although open reports extend the range of reportable phenomenology, their analysis presents significant challenges, particularly in systematically identifying patterns. To address this challenge, we implemented a data-driven approach leveraging Large Language Models and Topic Modelling to uncover and interpret latent experiential topics directly from the Dreamachine's text-based reports. Our analysis confirmed the presence of simple VHs typically documented in scientific studies of SLS, while also revealing experiences of altered states of consciousness and complex hallucinations. Building on these findings, our computational approach expands the systematic study of subjective experience by enabling data-driven analyses of open-ended phenomenological reports, capturing experiences not readily identified through standard questionnaires. By revealing rich and multifaceted aspects of experiences, our study broadens our understanding of stroboscopically-induced phenomena while highlighting the potential of Natural Language Processing and Large Language Models in the emerging field of computational (neuro)phenomenology. More generally, this approach provides a practically applicable methodology for uncovering subtle hidden patterns of subjective experience across diverse research domains.

研究动机与目标

推动基于数据的分析，以超越预定义问卷的开放式主观报告。
表征 Dreamachine 数据集中刺激性现象学的全谱。
开发并文档化一个用于现象学文本分析的开源 NLP 流水线。

提出的方法

将报告按句子级进行标记化，以为嵌入创建粒度输入。
使用预训练的 SBERT 模型将文本编码为 768 维嵌入。
用 UMAP 降维，为聚类做准备。
使用 HDBSCAN 进行聚类，在不预设主题数量的情况下识别体验主题。
基于关键词和摘录，结合 c-TF-IDF 与 Llama-3-8B-Instruct 自动标注主题。
提供从预处理到标签生成的端到端开源工作流。

实验结果

研究问题

RQ1开放式 Dreamachine 报告中涌现的潜在体验主题是什么？
RQ2High Sensory (HS) 与 Deep Listening (DL) Dreamachine 条件下，主题结构有何差异？
RQ3无需研究者偏见，自动化的 LLM 标注是否能产生可靠且可解释的主题描述？
RQ4哪些一致性和聚类特征最能捕捉主观 Dreamachine 现象学的结构？

主要发现

HS 分析产生了 13 个体验主题，由 Llama 3 自动标注，涵盖视觉现象、改变的状态和自传式体验。
DL 分析产生了 7 个体验主题，标签同样由 Llama 3 生成，包括 Dream Imagery 和 Dissociative Experiences。
分层聚类揭示了三个主要的 HS 现象学群：视觉体验、改变的状态，以及记忆-精神/自传主题。
主题模型的一致性分数为 0.56（HS，14 个主题）和 0.57（DL，8 个主题），表明主题质量在可接受范围内。
MOSAIC 流水线已实现为开源工作流，包含可重复的预处理、嵌入、聚类与标注步骤。
该方法展示了数据驱动分析超越标准问卷、对多样化主观体验的潜在适用性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。