QUICK REVIEW

[論文レビュー] Toward a realistic model of speech processing in the brain with self-supervised learning

Juliette Millet, Charlotte Caucheteux|arXiv (Cornell University)|Jun 3, 2022

Neural Networks and Applications被引用数 41

ひとこと要約

この論文は、600時間の生データ音声で訓練された自己教師付き wav2vec 2.0 モデルが脳のような表現を学習し、皮質の音声階層と整合し、人間の脳と行動に類似した言語特異的表現を発達させることを示している。

ABSTRACT

Several deep neural networks have recently been shown to generate activations similar to those of the brain in response to the same input. These algorithms, however, remain largely implausible: they require (1) extraordinarily large amounts of data, (2) unobtainable supervised labels, (3) textual rather than raw sensory input, and / or (4) implausibly large memory (e.g. thousands of contextual words). These elements highlight the need to identify algorithms that, under these limitations, would suffice to account for both behavioral and brain responses. Focusing on the issue of speech processing, we here hypothesize that self-supervised algorithms trained on the raw waveform constitute a promising candidate. Specifically, we compare a recent self-supervised architecture, Wav2Vec 2.0, to the brain activity of 412 English, French, and Mandarin individuals recorded with functional Magnetic Resonance Imaging (fMRI), while they listened to ~1h of audio books. Our results are four-fold. First, we show that this algorithm learns brain-like representations with as little as 600 hours of unlabelled speech -- a quantity comparable to what infants can be exposed to during language acquisition. Second, its functional hierarchy aligns with the cortical hierarchy of speech processing. Third, different training regimes reveal a functional specialization akin to the cortex: Wav2Vec 2.0 learns sound-generic, speech-specific and language-specific representations similar to those of the prefrontal and temporal cortices. Fourth, we confirm the similarity of this specialization with the behavior of 386 additional participants. These elements, resulting from the largest neuroimaging benchmark to date, show how self-supervised learning can account for a rich organization of speech processing in the brain, and thus delineate a path to identify the laws of language acquisition which shape the human brain.

研究の動機と目的

データ/ラベル/入力/記憶の制約の下で、脳と行動を説明する生物学的に妥当な AI の探索を動機づける。
生 raw 音声の自己教師付き学習が脳のような音声表現を生み出せるかを検証する。
モデルの機能階層を皮質音声処理階層にマッピングする。
モデルにおける言語・音響特有の表現を評価し、人間の行動と脳データと比較する。

提案手法

約600 時間の限定的でラベルなし音声（フランス語、英語、北京語）および非音声聴覚データで wav2vec 2.0 のバリアントを訓練する。
HRF 畳み込み後の ridge 回帰によるエンコーディングモデルを用いて、約1時間のオーディオブックを聴取する 412 名の参加者の fMRI 応答と活性化を比較する。
自己教師付き、非母語話者の音声、母語話者の音声、非音声、監督付き音素予測を含む複数の訓練 regime を評価する。
モデル層を皮質音声領域へ対応づけるための脳領域別・層別予測力を評価する。
人間と比較した ABX 音素識別テストを実施し、母語 stimuli vs 非母語 stimuli でのモデル性能と比較する。

実験結果

リサーチクエスチョン

RQ1生 raw 音声の自己教師付き学習は限定データ下で脳のような表現を生み出せるか。
RQ2wav2vec 2.0 は脳の皮質音声処理階層と整合する機能階層を示すか。
RQ3音響・生成、音声特有、言語特有の表現は、脳の聴覚・音声・言語領域と同様の現れ方をするか。
RQ4脳と整合した表現は言語特異的であり、人間の行動的音素識別パターンと一致するか。

主な発見

自己教師付き wav2vec 2.0 は 600 時間の音声訓練後に脳のような表現を学習する。
モデルの機能階層は一次聴覚領域から STS や IFG のような高次領域まで、皮質音声処理階層と整合する。
モデルはヒトの前頭葉および側頭葉皮質に類似した、音響的・音声特有・言語特有の表現を発展させる。
母語モデルは非母語モデルより脳スコアが高く、ABX 音素識別は人間での言語特異性とモデルの言語特化と相関する。
人間とモデルの ABX 行動結果は、自己教師付き学習によって言語特異的表現が現れることを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。