QUICK REVIEW

[논문 리뷰] Toward a realistic model of speech processing in the brain with self-supervised learning

Juliette Millet, Charlotte Caucheteux|arXiv (Cornell University)|2022. 06. 03.

Neural Networks and Applications인용 수 41

한 줄 요약

본 논문은 600시간의 원시 음성으로 학습된 자기지도 학습 wav2vec 2.0 모델이 뇌-유사 표현을 학습하고, 피질 언어 처리 계층과 정렬되며, 인간의 두뇌와 행동과 유사한 언어 특이적 표현을 개발한다는 것을 보여준다.

ABSTRACT

Several deep neural networks have recently been shown to generate activations similar to those of the brain in response to the same input. These algorithms, however, remain largely implausible: they require (1) extraordinarily large amounts of data, (2) unobtainable supervised labels, (3) textual rather than raw sensory input, and / or (4) implausibly large memory (e.g. thousands of contextual words). These elements highlight the need to identify algorithms that, under these limitations, would suffice to account for both behavioral and brain responses. Focusing on the issue of speech processing, we here hypothesize that self-supervised algorithms trained on the raw waveform constitute a promising candidate. Specifically, we compare a recent self-supervised architecture, Wav2Vec 2.0, to the brain activity of 412 English, French, and Mandarin individuals recorded with functional Magnetic Resonance Imaging (fMRI), while they listened to ~1h of audio books. Our results are four-fold. First, we show that this algorithm learns brain-like representations with as little as 600 hours of unlabelled speech -- a quantity comparable to what infants can be exposed to during language acquisition. Second, its functional hierarchy aligns with the cortical hierarchy of speech processing. Third, different training regimes reveal a functional specialization akin to the cortex: Wav2Vec 2.0 learns sound-generic, speech-specific and language-specific representations similar to those of the prefrontal and temporal cortices. Fourth, we confirm the similarity of this specialization with the behavior of 386 additional participants. These elements, resulting from the largest neuroimaging benchmark to date, show how self-supervised learning can account for a rich organization of speech processing in the brain, and thus delineate a path to identify the laws of language acquisition which shape the human brain.

연구 동기 및 목표

데이터/라벨/입력/메모리 제약 하에서 뇌와 행동을 설명하는 생물학적으로 타당한 AI를 탐구하도록 동기를 부여한다.
원시 음성에 대한 자기지도 학습이 뇌와 유사한 음성 표현을 얻을 수 있는지 검증한다.
모델의 기능적 위계를 피질 음성 처리 계층에 매핑한다.
모델의 언어 및 음성 특이적 표현을 평가하고 이를 인간의 행동 및 두뇌 데이터와 비교한다.

제안 방법

약 ~600시간의 제한된 비라벨 음성(프랑스어, 영어, 표준 중국어)과 비음성 청각 데이터에 대해 wav2vec 2.0 변종을 학습시킨다.
HRF 컨볼루션 후 릿지 회귀를 사용하는 인코딩 모델로 412명 참가자가 ~1시간 분량의 오디오북을 듣는 동안의 fMRI 반응과 활성화를 비교한다.
자기지도 학습, 비모국어 음성, 모국어 음성, 비음성, 그리고 지도학습 음소 예측을 포함한 여러 학습 체계를 평가한다.
뇌 영역 및 층별 예측력을 평가하여 모델 층을 피질 음성 영역에 매핑한다.
사람과 함께 ABX 음소 구별 테스트를 수행하고 모국어 자극과 비모국어 자극에서의 모델 성능과 비교한다.

실험 결과

연구 질문

RQ1제한된 데이터에서 원시 음성에 대한 자기지도 학습이 뇌-유사한 표현을 생성할 수 있는가?
RQ2wav2vec 2.0이 뇌의 피질 음성 처리 위계와 정렬되는 기능적 위계를 보이는가?
RQ3소리/생성, 음성 특이적, 언어 특이적 표현이 뇌의 청각, 음성, 언어 영역에 비유되는 방식으로 나타나는가?
RQ4뇌 정렬 표현이 언어 특이적이며 인간의 행동적 음소 구별 패턴과 대응하는가?

주요 결과

자기지도 wav2vec 2.0은 600시간의 음성 학습 후 뇌-유사 표현을 학습한다.
모델의 기능적 위계가 피질 음성 처리 위계와 일치하며 1차 청각 영역에서 STS 및 IFG 같은 상위 영역까지 이어진다.
모델은 인간의 전전두엽 및 측두겉질과 유사한 음향적, 음성 특이적 및 언어 특이적 표현을 개발한다.
모국어 모델이 비모국어 모델보다 더 높은 뇌 점수를 산출하며, 인간의 ABX 음소 구별은 모델의 언어 특화성과 병행한다.
인간의 행동적 ABX 결과와 모델 비교는 자기지도 학습으로 언어 특이적 표현이 나타남을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.