QUICK REVIEW

[论文解读] Visual representations in the human brain are aligned with large language models

Adrien Doerig, Tim C. Kietzmann|arXiv (Cornell University)|Sep 23, 2022

Multimodal Machine Learning Applications被引用 37

一句话总结

本研究表明，大型语言模型（LLMs）的场景描述嵌入可以表征自然场景诱发的脑活动，并且将图像转换到 LLM 空间能够产生与脑数据高度一致的表征。

ABSTRACT

The human brain extracts complex information from visual inputs, including objects, their spatial and semantic interrelations, and their interactions with the environment. However, a quantitative approach for studying this information remains elusive. Here, we test whether the contextual information encoded in large language models (LLMs) is beneficial for modelling the complex visual information extracted by the brain from natural scenes. We show that LLM embeddings of scene captions successfully characterise brain activity evoked by viewing the natural scenes. This mapping captures selectivities of different brain areas, and is sufficiently robust that accurate scene captions can be reconstructed from brain activity. Using carefully controlled model comparisons, we then proceed to show that the accuracy with which LLM representations match brain representations derives from the ability of LLMs to integrate complex information contained in scene captions beyond that conveyed by individual words. Finally, we train deep neural network models to transform image inputs into LLM representations. Remarkably, these networks learn representations that are better aligned with brain representations than a large number of state-of-the-art alternative models, despite being trained on orders-of-magnitude less data. Overall, our results suggest that LLM embeddings of scene captions provide a representational format that accounts for complex information extracted by the brain from visual inputs.

研究动机与目标

研究在 LLMs 中编码的上下文信息是否有助于建模大脑对复杂视觉表征的能力。
描述场景描述的 LLM 嵌入如何映射到自然场景诱发的脑活动。
评估基于 LLM 的表征是否捕捉到超越单词级别的信息，并与大脑区域的选择性相关。
探索在有限数据下，将图像映射到 LLM 空间的深度网络是否能实现与脑数据的强对齐。

提出的方法

计算描述自然场景的场景描述的 LLM 嵌入，并将其与观看场景时测得的脑活动模式相关联。
评估不同脑区对 LLM 派生表征的选择性。
尝试从脑活动重建出准确的场景描述。
训练深度神经网络将图像输入转换为 LLM 表征，并与大量基线进行脑对齐比较。
进行精心控制的模型比较，以隔离整合的、基于场景描述层面的信息的贡献。

实验结果

研究问题

RQ1场景描述的 LLM 嵌入是否能定量表征对自然场景的脑反应？
RQ2基于 LLM 的表征是否捕捉到超越单个词或局部特征的脑选择性？
RQ3是否可能使用 LLM 表征从脑活动中重建场景描述？
RQ4将图像映射到 LLM 的变换模型是否比现有的最先进模型在脑对齐方面更强？

主要发现

场景描述的 LLM 嵌入能够成功表征观看自然场景所诱发的脑活动。
该映射捕捉到不同脑区的选择性。
可以从脑活动重建出准确的场景描述。
脑-LLM 对齐的准确性来自 LLMs' 能力在描述中整合超越单个词的复杂信息。
训练将图像映射到 LLM 表征的深度网络所得到的表征，与脑数据的对齐度比许多替代模型更高，尽管使用的数据显著较少。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。