QUICK REVIEW

[论文解读] Pixels to Voxels: Modeling Visual Representation in the Human Brain

Pulkit Agrawal, Dustin Stansbury|arXiv (Cornell University)|Jul 18, 2014

Visual Attention and Saliency Detection参考文献 20被引用 76

一句话总结

本文提出了一种新颖的框架，直接从图像像素预测视觉皮层的人脑活动，采用Fisher向量（Fisher Vectors）和卷积神经网络（ConvNets），无需依赖人工标注的语义标签。结果表明，两种模型均能准确预测初级、中间和高级视觉区域的fMRI反应，揭示了外侧纹状体体区（EBA）内的功能亚区。

ABSTRACT

The human brain is adept at solving difficult high-level visual processing problems such as image interpretation and object recognition in natural scenes. Over the past few years neuroscientists have made remarkable progress in understanding how the human brain represents categories of objects and actions in natural scenes. However, all current models of high-level human vision operate on hand annotated images in which the objects and actions have been assigned semantic tags by a human operator. No current models can account for high-level visual function directly in terms of low-level visual input (i.e., pixels). To overcome this fundamental limitation we sought to develop a new class of models that can predict human brain activity directly from low-level visual input (i.e., pixels). We explored two classes of models drawn from computer vision and machine learning. The first class of models was based on Fisher Vectors (FV) and the second was based on Convolutional Neural Networks (ConvNets). We find that both classes of models accurately predict brain activity in high-level visual areas, directly from pixels and without the need for any semantic tags or hand annotation of images. This is the first time that such a mapping has been obtained. The fit models provide a new platform for exploring the functional principles of human vision, and they show that modern methods of computer vision and machine learning provide important tools for characterizing brain function.

研究动机与目标

开发一种计算模型，直接从低层次视觉输入（像素）预测人类视觉脑活动，避免对人工标注语义标签的依赖。
克服以往编码模型依赖主观且耗时的人工图像类别标注的局限性。
探究现代计算机视觉特征——Fisher向量和ConvNets——是否能捕捉人类视觉系统在整个皮层层级上的功能组织。
探索既定视觉感兴趣区域（ROIs）内的细粒度功能组织，如外侧纹状体体区（EBA）。

提出的方法

采用Fisher向量（FV）对局部图像描述符（如SIFT）进行编码，从原始像素输入生成高维、判别性强的特征。
使用预训练的卷积神经网络（ConvNet）从相同像素输入中提取分层的特征表示。
应用正则化线性回归将FV和ConvNet特征映射到视觉皮层的fMRI体素响应，为每个体素拟合一个独立模型。
利用拟合后的模型预测新图像的脑活动，通过决定系数（R²）评估性能。
对EBA内体素的ConvNet模型权重进行K均值聚类，以识别功能上不同的体素子群。
将功能聚类投影到皮层展开图上，评估空间分离性并验证不同受试者间的解剖一致性。

实验结果

研究问题

RQ1仅基于像素级特征（无语义标注）训练的模型能否以与基于人工标注标签的模型相当的精度预测人类视觉皮层的fMRI反应？
RQ2Fisher向量和ConvNet特征是否能捕捉到与人类脑活动模式一致的低层次和高层次视觉表征？
RQ3编码模型能否揭示经典视觉ROI（如EBA）内的亚结构，表明其功能细分？
RQ4在EBA内识别出的功能聚类在空间上是否具有连贯性，并在不同受试者间保持一致？
RQ5在聚类内部训练的ConvNet模型是否能显著优于在另一聚类上训练的模型，来预测其对应体素聚类的活动？

主要发现

Fisher向量和ConvNet模型在高级视觉区域预测fMRI反应时，决定系数（R²）与以往基于人工标注语义特征的模型相当。
FV和ConvNet模型不仅成功预测了高级视觉区域的脑活动，还准确预测了初级和中间视觉区域的活动，而以往基于语义标注的模型无法实现这一点。
对ConvNet模型权重进行K均值聚类揭示了EBA内两个稳定且功能不同的聚类：一个对全身运动敏感，另一个对单个人物敏感。
功能聚类在皮层展开图上空间分离，且在两名受试者中表现出一致的解剖位置。
在聚类内部训练的ConvNet模型在其对应体素聚类中解释的方差显著高于在另一聚类上训练的模型（例如，对于受试者-1，C1解释了24.9% vs. C2的19.3%；对于受试者-2，C2解释了23.0% vs. C1的16.2%）。
结果证实，EBA包含两个在功能和空间上均不同的亚区，各自对视觉刺激具有不同的响应偏好。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。