QUICK REVIEW

[論文レビュー] Brain encoding models based on multimodal transformers can transfer across language and vision

Jerry Tang, Meng Du|PubMed|May 20, 2023

Language, Metaphor, and Cognition被引用数 16

ひとこと要約

この論文は、言語ストーリーに対するfMRI応答で訓練されたエンコードモデルが映画への脳応答を予測できること、そしてBridgeTowerマルチモーダルトランスフォーマーを用いてその逆も可能であることを示しており、クロスモダリティ転移は共有意味表現を明らかにし、マルチモーダル特徴がユニモーダルアラインメントを上回る。

ABSTRACT

Encoding models have been used to assess how the human brain represents concepts in language and vision. While language and vision rely on similar concept representations, current encoding models are typically trained and tested on brain responses to each modality in isolation. Recent advances in multimodal pretraining have produced transformers that can extract aligned representations of concepts in language and vision. In this work, we used representations from multimodal transformers to train encoding models that can transfer across fMRI responses to stories and movies. We found that encoding models trained on brain responses to one modality can successfully predict brain responses to the other modality, particularly in cortical regions that represent conceptual meaning. Further analysis of these encoding models revealed shared semantic dimensions that underlie concept representations in language and vision. Comparing encoding models trained using representations from multimodal and unimodal transformers, we found that multimodal transformers learn more aligned representations of concepts in language and vision. Our results demonstrate how multimodal transformers can provide insights into the brain's capacity for multimodal processing.

研究の動機と目的

一方のモダリティ（言語または視覚）で訓練されたエンコードモデルが、もう一方のモダリティに対する脳応答を予測できるかを調査する。
BridgeTowerの特徴を用いて、言語と視覚の概念が脳内で整列するかを評価する。
言語と視覚表現を共有する意味的次元を特定する。
マルチモーダルトレーニングがユニモーダル特徴アラインメントよりクロスモーダリティ転送を改善するかを評価する。

提案手法

画像-テキストデータで訓練されたマルチモーダルトランスフォーマーであるBridgeTowerを用いて、ストーリーと映画の刺激特徴を抽出する。
BridgeTowerの特徴を用いて、ストーリーフMRIに対する言語エンコードモデルを、映画ファMRIに対する視覚エンコードモデルを訓練する。
ストーリー特徴から映画-fMRIを予測し、映画特徴からストーリー-fMRIを予測することでクロスモーダリティ転送を評価する。
Flickr30Kから推定される線形写像でBridgeTowerの特徴空間を整列させ、クロスモダリティ射影を可能にする。
刺激を脳応答にマッピングするため、血流遅延補正を伴うボクセル単位のL2正則化回帰を実施する。

実験結果

リサーチクエスチョン

RQ1言語応答で訓練されたエンコードモデルは、視覚映画刺激へのfMRI応答を予測できるか、またその逆は可能か？
RQ2クロスモーダリティ転送は、皮質における言語と視覚の意味表現が整列していることを示すか？
RQ3マルチモーダルトランスフォーマーの特徴は、ユニモーダル特徴よりも良いクロスモーダリティ転送をもたらすか？
RQ4脳内で共有される言語-視覚表現の基盤となる意味的次元は何か？

主な発見

クロスモダリティのエンコーディング性能は、一次感覚領域外の多くの頭頂・側頭・前頭領域で正である。
視覚皮質での反転チューニングは補正を要し、それがクロスモーダリティ転送推定を改善する。
クロスモーダリティの性能は、いくつかの領域で同一モダリティの性能に近づき、モダリティ間で概念表現が類似していることを示唆する。
マルチモodal BridgeTower特徴は、視覚/聴覚皮質以外の領域でユニモーダル RoBERTa および ViT特徴よりもクロスモーダリティ転送で優れている。
エンコーディング重みのPCAは、特に PCs 1, 3, 5 において、言語と視覚の共通意味次元をマルチモーダルボクセルで示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。