QUICK REVIEW

[论文解读] Brain encoding models based on multimodal transformers can transfer across language and vision

Jerry Tang, Meng Du|PubMed|May 20, 2023

Language, Metaphor, and Cognition被引用 16

一句话总结

该论文表明，基于对语言故事的fMRI反应训练的编码模型可以预测对电影的脑反应，反之亦然，使用 BridgeTower 多模态变换器；跨模态传递揭示共同的语义表征，并且多模态特征在对比单模态对齐时具有更好的性能。

ABSTRACT

Encoding models have been used to assess how the human brain represents concepts in language and vision. While language and vision rely on similar concept representations, current encoding models are typically trained and tested on brain responses to each modality in isolation. Recent advances in multimodal pretraining have produced transformers that can extract aligned representations of concepts in language and vision. In this work, we used representations from multimodal transformers to train encoding models that can transfer across fMRI responses to stories and movies. We found that encoding models trained on brain responses to one modality can successfully predict brain responses to the other modality, particularly in cortical regions that represent conceptual meaning. Further analysis of these encoding models revealed shared semantic dimensions that underlie concept representations in language and vision. Comparing encoding models trained using representations from multimodal and unimodal transformers, we found that multimodal transformers learn more aligned representations of concepts in language and vision. Our results demonstrate how multimodal transformers can provide insights into the brain's capacity for multimodal processing.

研究动机与目标

研究在一种模态（语言或视觉）上训练的编码模型是否可以预测对另一模态的脑反应。
确定多模态变换器表示是否能在大脑中将语言和视觉概念对齐。
识别语言和视觉表示之间共享的语义维度。
评估多模态训练是否比单模态特征对齐在跨模态传递方面表现更好。

提出的方法

使用 BridgeTower——一个在图像-文本数据上训练的多模态变换器——来提取故事和电影的刺激特征。
使用 BridgeTower 特征，在 story-fMRI 上训练语言编码模型，在 movie-fMRI 上训练视觉编码模型。
通过用故事特征预测 movie-fMRI，以及用电影特征预测 story-fMRI 来评估跨模态传递。
用从 Flickr30K 估计的线性映射对 BridgeTower 特征空间进行对齐，以实现跨模态投影。
进行体素级、L2 正则化回归并结合血流动力学延迟校正，将刺激映射到脑反应。

实验结果

研究问题

RQ1用语言反应训练的编码模型是否可以预测对视觉电影刺激的 fMRI 反应，反之亦然？
RQ2跨模态传递是否揭示皮层中语言与视觉之间对齐的语义表示？
RQ3多模态变换器特征是否比单模态特征在跨模态传递上表现更好？
RQ4在大脑中，共享的语言-视觉表示背后存在哪些语义维度？

主要发现

跨模态编码在很多顶叶、颞叶和额叶区域（初级感官方位之外）呈正向表现。
视觉皮层中的反向调谐需要校正，这会改善跨模态传递的估计。
在若干区域，跨模态表现接近单模态表现，表明两种模态之间存在相似的概念表示。
在视觉/听觉皮层外的跨模态传递中，多模态 BridgeTower 特征优于单模态 RoBERTa 和 ViT 特征。
对编码权重进行 PCA 表明多模态体素中语言与视觉共享语义维度，尤以 PC1、PC3、PC5 为甚。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。