QUICK REVIEW

[论文解读] Large language models predict human sensory judgments across six modalities

Raja Marjieh, Ilia Sucholutsky|arXiv (Cornell University)|Feb 2, 2023

Categorization, perception, and language被引用 11

一句话总结

最先进的语言模型（GPT-3/3.5/4）在六种感官模态上产生成对的相似性判断，与人类数据显著相关，能够复现诸如色轮和音高螺旋等已知表征，并揭示颜色命名中的语言依赖效应。

ABSTRACT

Determining the extent to which the perceptual world can be recovered from language is a longstanding problem in philosophy and cognitive science. We show that state-of-the-art large language models can unlock new insights into this problem by providing a lower bound on the amount of perceptual information that can be extracted from language. Specifically, we elicit pairwise similarity judgments from GPT models across six psychophysical datasets. We show that the judgments are significantly correlated with human data across all domains, recovering well-known representations like the color wheel and pitch spiral. Surprisingly, we find that a model (GPT-4) co-trained on vision and language does not necessarily lead to improvements specific to the visual modality. To study the influence of specific languages on perception, we also apply the models to a multilingual color-naming task. We find that GPT-4 replicates cross-linguistic variation in English and Russian illuminating the interaction of language and perception.

研究动机与目标

研究使用大型语言模型从语言中能够恢复多少关于世界的感知信息。
评估由LLM得出的相似性判断是否与跨多模态的人类感知表征一致。
检验多模态训练（文本+图像）是否比仅语言驱动特定模态的预测能力。
通过在英语和俄语中测试颜色命名，探索感知中的跨语言效应。

提出的方法

使用定制提示和上下文示例，从GPT-3、GPT-3.5和GPT-4中为每个刺激对提取10对成对相似性评分。
使用跨六种模态的皮尔逊相关系数，将模型得出的相似性分数与人类数据进行比较。
通过多维尺度分析（MDS）分析已知感知结构的出现，以恢复色轮、音高螺旋和辅音表征。
进行多语言颜色命名任务（英语和俄语），以测试感知表征的语言依赖性。
提供模型生成的对判断的解释，以评估是否与感知概念（八度关系、发音位置、颜色光谱）一致。

Рис. 1: A. Schematic of the LLM-based and human similarity judgment elicitation paradigms. B. Correlations between models and human data across six perceptual modalities, namely, pitch, loudness, colors, consonants, taste, and timbre (Pearson $r$ ; 95% CIs).

实验结果

研究问题

RQ1LLMs是否能够给出与跨多模态的人类感知表征一致的相似性判断？
RQ2LLMs是否能从语言中恢复诸如色轮和音高螺旋等众所周知的感知结构？
RQ3多模态训练是否在语言之外提升了特定模态的性能？
RQ4颜色命名和感知表征是否会受提示语言的影响，揭示语言相关的感知？
RQ5LLMs在多大程度上复制了人类观察到的颜色命名跨语言差异？

主要发现

GPT-4在大多数模态上与人类数据的对齐最强，相关性如音高 r=.92、颜色 r=.89。
GPT-3.5在响度（r=.89）及其他领域获得高相关性，总体性能通常位居前两名模型。
音高（r=.90）和辅音（r=.46）的评估者间一致性（IRR）表明GPT-4在某些领域的表现接近人类的一致性。
MDS分析揭示可解释的感知空间：具有12半音结构的音高螺旋、色轮以及基于产出的辅音表征。
使用GPT-4的颜色命名重现了英语与俄语之间的跨语言差异，与已知的人类跨语言模式一致。
GPT-4的提升表现归因于更丰富的文本训练，而不仅仅是多模态（图像）输入。

Рис. 2: A. Human and LLM similarity marginals and an example GPT-3 corresponding similarity matrix and its three-dimensional MDS solution for pitch. B. MDS solutions for vocal consonants and colors for GPT-4 similarity matrices. To illustrate the structure of the results, we highlighted consonants w

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。