[论文解读] Toward a realistic model of speech processing in the brain with self-supervised learning
本文展示了在600小时原始语音上训练的自监督 wav2vec 2.0 模型能够学习类似大脑的表示,与皮层语言层次结构对齐,并发展出类似人脑和行为的语言特异性表示。
Several deep neural networks have recently been shown to generate activations similar to those of the brain in response to the same input. These algorithms, however, remain largely implausible: they require (1) extraordinarily large amounts of data, (2) unobtainable supervised labels, (3) textual rather than raw sensory input, and / or (4) implausibly large memory (e.g. thousands of contextual words). These elements highlight the need to identify algorithms that, under these limitations, would suffice to account for both behavioral and brain responses. Focusing on the issue of speech processing, we here hypothesize that self-supervised algorithms trained on the raw waveform constitute a promising candidate. Specifically, we compare a recent self-supervised architecture, Wav2Vec 2.0, to the brain activity of 412 English, French, and Mandarin individuals recorded with functional Magnetic Resonance Imaging (fMRI), while they listened to ~1h of audio books. Our results are four-fold. First, we show that this algorithm learns brain-like representations with as little as 600 hours of unlabelled speech -- a quantity comparable to what infants can be exposed to during language acquisition. Second, its functional hierarchy aligns with the cortical hierarchy of speech processing. Third, different training regimes reveal a functional specialization akin to the cortex: Wav2Vec 2.0 learns sound-generic, speech-specific and language-specific representations similar to those of the prefrontal and temporal cortices. Fourth, we confirm the similarity of this specialization with the behavior of 386 additional participants. These elements, resulting from the largest neuroimaging benchmark to date, show how self-supervised learning can account for a rich organization of speech processing in the brain, and thus delineate a path to identify the laws of language acquisition which shape the human brain.
研究动机与目标
- 激发在数据/标签/输入/记忆约束下解释大脑和行为的生物学上可行的 AI 的探索。
- 测试在原始语音上的自监督学习是否能够产生类似大脑的语音表示。
- 将模型的功能层级映射到皮层语言处理层级。
- 评估模型中的语言与声音特异性表示,并与人类行为和大脑数据进行比较。
提出的方法
- 在约600小时受限、未标注的语音(法语、英语、汉语普通话)以及非语音听觉数据上训练 wav2vec 2.0 的变体。
- 使用编码模型(在对 HRF 卷积后进行岭回归)将激活与412名参与者收听约1小时有声读物的 fMRI 响应进行比较。
- 评估多种训练方案,包括自监督、非母语语音、母语语音、非语音以及监督音素预测。
- 评估脑区和逐层的预测能力,以将模型层映射到皮层语言区域。
- 进行人类的 ABX 音素判别测试,并将其与模型在母语与非母语语言刺激下的表现进行比较。
实验结果
研究问题
- RQ1在有限数据下,原始语音的自监督学习是否能产生类似大脑的表示?
- RQ2wav2vec 2.0 是否呈现出与大脑皮层语言处理层次结构对齐的功能层次?
- RQ3声音/生成、语音特异性和语言特异性的表示是否以类似于大脑听觉、语音和语言区域的方式出现?
- RQ4与大脑对齐的表示是否具备语言特异性,并且是否对应人类行为的音素判别模式?
主要发现
- 自监督 wav2vec 2.0 在训练于600小时语音后学习到大脑样的表示。
- 该模型的功能层级与皮层语言处理层级一致,从初级听觉区域到像 STS 和 IFG 这样的高层区域。
- 该模型发展出与人类前额叶和颞叶皮质相似的声学、语音特异性和语言特异性表示。
- 母语模型相比非母语模型获得更高的脑分数,且人类的 ABX 音素判别与模型的语言专门化并行。
- 人类行为的 ABX 结果与模型比较显示,语言特异性表示在自监督学习中出现。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。