In latest research on speech processing, Meta AI researchers explain their study of similarities between deep learning models and the human brain

This Article is written as a summay by Marktechpost Staff based on the research paper 'Toward a realistic model of speech processing in the
brain with self-supervised learning'. All Credit For This Research Goes To Researchers on This Project. Checkout the paper .

Please Don't Forget To Join Our ML Subreddit

Over the past decade, the performance of deep neural networks has exploded. Algorithms for object categorization, text translation and speech recognition come close to human performance. The representations of these algorithms have been proven to coincide with those of the brain on several occasions, implying that they converge to brain-like calculations.

However, this convergence should not obscure the important distinctions that still exist between these deep learning models and the brain. The following comparisons are based on models that have been trained using massive amounts of data, supervised labels uncommon in human experience, textual rather than raw sensory data, and/or lots of memory. These disparities underscore the urgent need to create learning architectures and goals that would be sufficient to account for both behavior and brain responses under these four restrictions.

Meta AI researchers suggested in a recent study that newer self-supervising architectures trained on raw sensory inputs are attractive candidates. The team focused on Wav2Vec 2.0, an architecture that stacks convolutional and transformative layers to predict quantization of latent speech waveform representations. Wav2Vec 2.0 was trained over 600 hours of speech, which is roughly equivalent to what newborns are exposed to during the early stages of language acquisition.

The researchers compared this model to the brains of 412 healthy people (351 English speakers, 28 French speakers and 33 Mandarin speakers) whose brains were recorded by functional magnetic resonance imaging (fMRI) while listening to audio novels in their language. kindergarten for about an hour.


The researchers compared brain activity at each layer of the Wav2Vec 2.0 model, as well as several variations, including a random (untrained) Wav2Vec 2.0 model, a model trained over 600 h of non-vocal sounds, a model trained over 600 h of unvoiced sounds. native speech, a model trained on 600 h of native speech, a model trained on 600 h of native speech, and a model trained directly in speech-to-text on the native language of the participants.

The experiments produced four significant contributions. First, Wav2Vec 2.0 uses self-supervised learning to acquire latent speech waveform representations that are similar to those seen in the human brain. Second, the functional hierarchy of transformer layers coincides with the cortical hierarchy of speech in the brain, revealing the cerebral arrangement of speech processing in unprecedented detail. Third, model representations of hearing, speech, and language converge with those of the human brain. Fourth, behavioral comparisons with the results of a speech sound discrimination exercise performed by 386 other participants indicate this shared linguistic specialization.


Human babies learn to communicate with little or no assistance. Young brains only need a few hundred hours of speech to learn how to put words together in the language(s) of their social group. Meta AI researchers recently investigated whether self-supervised learning on a small portion of speech is sufficient to produce a functionally equal model of speech perception in the human brain. The researchers used three datasets organized in French, English and Mandarin to train several variants of Wav2Vec 2.0, then compared their activations to those of a large population of French, English and Mandarin speakers recorded using fMRI while passively listening to audio stories. The results revealed that this self-supervised model learns representations that correspond linearly to a remarkably distributed set of brain regions, a hierarchy that corresponds to the cortical hierarchy, and language-specific properties.

James G. Williams