QUICK REVIEW

[论文解读] Mind Reader: Reconstructing complex images from brain activities

Sikun Lin, Thomas C. Sprague|arXiv (Cornell University)|Sep 30, 2022

Cell Image Analysis Techniques被引用 34

一句话总结

该论文通过将脑信号映射到一个事先对齐的视觉-语言潜在空间（CLIP），并使用一个条件生成器在该空间及图像标题条件下生成图像，从而重构复杂且具有语义信息的图像。

ABSTRACT

Understanding how the brain encodes external stimuli and how these stimuli can be decoded from the measured brain activities are long-standing and challenging questions in neuroscience. In this paper, we focus on reconstructing the complex image stimuli from fMRI (functional magnetic resonance imaging) signals. Unlike previous works that reconstruct images with single objects or simple shapes, our work aims to reconstruct image stimuli that are rich in semantics, closer to everyday scenes, and can reveal more perspectives. However, data scarcity of fMRI datasets is the main obstacle to applying state-of-the-art deep learning models to this problem. We find that incorporating an additional text modality is beneficial for the reconstruction problem compared to directly translating brain signals to images. Therefore, the modalities involved in our method are: (i) voxel-level fMRI signals, (ii) observed images that trigger the brain signals, and (iii) textual description of the images. To further address data scarcity, we leverage an aligned vision-language latent space pre-trained on massive datasets. Instead of training models from scratch to find a latent space shared by the three modalities, we encode fMRI signals into this pre-aligned latent space. Then, conditioned on embeddings in this space, we reconstruct images with a generative model. The reconstructed images from our pipeline balance both naturalness and fidelity: they are photo-realistic and capture the ground truth image contents well.

研究动机与目标

研究从 fMRI 信号解码复杂的现实世界场景图像。
评估增加文本模态是否相较于仅视觉映射改善了图像重建。
利用一个大型、对齐的视觉-语言潜在空间来缓解 fMRI 数据稀缺。
展示一个两阶段训练方法，将脑到空间映射与空间到图像生成分离。

提出的方法

使用基于 CNN 的映射器，将 ROI 简化后的 fMRI 信号映射到 CLIP 的图像嵌入和 CLIP 的标题嵌入。
使用 CLIP 编码器筛选图像标题，以为每个图像挑选高质量标题（标题筛选）。
使用两个独立的映射模型（fMRI 到 h_img 和 fMRI 到 h_cap），以对齐图像和标题的 CLIP 嵌入。
在 CLIP 对齐的嵌入上微调一个条件生成器（Lafite 风格），以产生在 h_img' 和 h_cap' 嵌入条件下的图像。
应用多分量损失（MSE、余弦相似度、对比损失）以及一个三头判别器，以实现真实感与跨模态对齐。
分两阶段训练：(i) 学习 fMRI→CLIP 嵌入；(ii) 在冻结的映射器情况下微调条件生成器。

实验结果

研究问题

RQ1Can fMRI signals be mapped into a shared CLIP space that jointly encodes image content and textual captions?
RQ2Does incorporating a text modality (captions) improve reconstruction fidelity for complex scenes compared to image-only mappings?
RQ3How does a pre-trained, multimodal generator perform when conditioned on brain-derived CLIP embeddings?
RQ4What is the impact of training strategy (frozen mappers vs end-to-end) on reconstruction quality?

主要发现

The pipeline produces photo-realistic reconstructions that preserve content and relationships in complex scenes.
Mapping fMRI to CLIP embeddings (especially via the image path) yields better FID performance than caption-only mappings in most settings.
Using a two-head conditioning (image and caption CLIP embeddings) with a LF-Lafite pre-trained generator outperforms single-head setups under several metrics.
End-to-end finetuning of mappers with GAN objectives generally degrades performance and can collapse embedding representations.
CLIP space serves as an effective intermediary for brain decoding, with fMRI and CLIP embeddings providing complementary information for reconstruction.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。