QUICK REVIEW

[論文レビュー] Visual representations in the human brain are aligned with large language models

Adrien Doerig, Tim C. Kietzmann|arXiv (Cornell University)|Sep 23, 2022

Multimodal Machine Learning Applications被引用数 37

ひとこと要約

本研究は、scene captions の大規模言語モデル（LLMs）の埋め込みが自然シーンによって誘発される脳活動を特徴づけること、そして images を LLM 空間へ変換することで脳データと高い整合性を持つ表現をもたらすことを示している。

ABSTRACT

The human brain extracts complex information from visual inputs, including objects, their spatial and semantic interrelations, and their interactions with the environment. However, a quantitative approach for studying this information remains elusive. Here, we test whether the contextual information encoded in large language models (LLMs) is beneficial for modelling the complex visual information extracted by the brain from natural scenes. We show that LLM embeddings of scene captions successfully characterise brain activity evoked by viewing the natural scenes. This mapping captures selectivities of different brain areas, and is sufficiently robust that accurate scene captions can be reconstructed from brain activity. Using carefully controlled model comparisons, we then proceed to show that the accuracy with which LLM representations match brain representations derives from the ability of LLMs to integrate complex information contained in scene captions beyond that conveyed by individual words. Finally, we train deep neural network models to transform image inputs into LLM representations. Remarkably, these networks learn representations that are better aligned with brain representations than a large number of state-of-the-art alternative models, despite being trained on orders-of-magnitude less data. Overall, our results suggest that LLM embeddings of scene captions provide a representational format that accounts for complex information extracted by the brain from visual inputs.

研究の動機と目的

LLMs にエンコードされた文脈情報が、脳の複雑な視覚表現のモデリングに有用かを調査する。
scene captions の LLM 埋め込みが自然シーンによって喚起される脳活動にどのように対応するかを特徴づける。
LLM ベースの表現が個々の語よりも情報を捉え、脳領域間の選択性と関連しているかを評価する。
画像を LLM 空間へ写像するよう訓練された深層ネットワークが、限られたデータで強い脳整合性を達成するかを探る。

提案手法

自然シーンを説明する captions の LLM 埋め込みを計算し、それを scene viewing 中に測定された脳活動パターンと関連付ける。
LLM由来の表現に対する異なる脳領域の選択性を評価する。
脳活動から正確な scene captions の再構成を試みる。
画像入力を LLM 表現へ変換する深層ニューラルネットワークを訓練し、多数のベースラインと脳との整合性を比較する。
統合されたキャプションレベル情報の寄与を分離するため、慎重に制御されたモデル比較を行う。

実験結果

リサーチクエスチョン

RQ1scene captions の LLM 埋め込みは自然シーンに対する脳応答を定量的に特徴づけることができるか？
RQ2LLMベースの表現は個々の語や局所特徴を超えた脳選択性を捉えるか？
RQ3LLM 表現を用いて脳活動から scene captions を再構成することは可能か？
RQ4画像→LLM 変換モデルは既存の最先端モデルよりも強い脳整合性を達成するのか？

主な発見

scene captions の LLM 埋め込みは、自然シーンの視聴によって誘発される脳活動をうまく特徴づける。
この対応は異なる脳領域の選択性を捉える。
脳活動から正確な scene captions を再構成できる。
脳-LLM 整合性の正確さは、LLMs がキャプション中の単語を超えた複雑な情報を統合する能力に起因する。
画像を LLM 表現へ写像するよう訓練された深層ネットワークは、訓練データを大幅に少なく抑えつつ、脳データとより良く整合する表現を生み出す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。