Decoding speech perception from non-invasive brain recordings
nature machine intelligence
Article
https://doi.org/10.1038/s42256-023-00714-5
Decoding speech perception from
non-invasive brain recordings
Received: 20 September 2022
Accepted: 4 August 2023
Published online: 5 October 2023
Check for updates
Alexandre Défossez
Jean-Rémi King 1,4
1
, Charlotte Caucheteux1,2, Jérémy Rapin1, Ori Kabeli3 &
Decoding speech from brain activity is a long-awaited goal in both healthcare
and neuroscience. Invasive devices have recently led to major milestones
in this regard: deep-learning algorithms trained on intracranial recordings
can now start to decode elementary linguistic features such as letters, words
and audio-spectrograms. However, extending this approach to natural
speech and non-invasive brain recordings remains a major challenge.
Here we introduce a model trained with contrastive learning to decode
self-supervised representations of perceived speech from the non-invasive
recordings of a large cohort of healthy individuals. To evaluate this approach,
we curate and integrate four public datasets, encompassing 175 volunteers
recorded with magneto-encephalography or electro-encephalography
while they listened to short stories and isolated sentences. The results show
that our model can identify, from 3 seconds of magneto-encephalography
signals, the corresponding speech segment with up to 41% accuracy out
of more than 1,000 distinct possibilities on average across participants,
and with up to 80% in the best participants—a performance that allows the
decoding of words and phrases absent from the training set. The comparison
of our model with a variety of baselines highlights the importance of a
contrastive objective, pretrained representations of speech and a common
convolutional architecture simultaneously trained across multiple
participants. Finally, the analysis of the decoder’s predictions suggests that
they primarily depend on lexical and contextual semantic representations.
Overall, this effective decoding of perceived speech from non-invasive
recordings delineates a promising path to decode language from brain
activity, without putting patients at risk of brain surgery.
Every year, traumatic brain injuries, strokes and neurodegenerative
diseases cause thousands of patients lose their ability to speak or even
communicate1–6. Brain–computer interfaces (BCIs) have raised high
expectations for the detection4,5,7,8 and restoration of communication
abilities in such patients9–14. Over recent decades, several teams have
used BCIs to efficiently decode phonemes, speech sounds15,16, hand
gestures11,12 and articulatory movements13,17 from electrodes implanted
in the cortex or over its surface. For instance, Willett et al. 12 decoded
90 characters per minute (with a 94% accuracy, that is, roughly 15–18
words per minute) from a patient with a spinal-cord injury, recorded
in the motor cortex during 10 hours of writing sessions. Similarly,
Moses et al. 13 decoded 15.2 words per minute (with 74.4% accuracy,
and using a vocabulary of 50 words) in a patient with anarthria and a
BCI implanted in the sensori-motor cortex, recorded over 48 sessions
1
Meta AI, Paris, France. 2Inria Saclay, Saclay, France. 3Meta AI, Tel Aviv, Israel. 4LSP, Département d’Etudes Cognitives, École Normale Supérieure,
PSL University, Paris, France.
e-mail: ;
Nature Machine Intelligence | Volume 5 | October 2023 | 1097–1107
1097
Article
https://doi.org/10.1038/s42256-023-00714-5
spanning over 22 hours. Finally, Metzger et al. 18 recently showed that a
patient with severe limb and vocal-tract paralysis and a BCI implanted
in the sensori-motor cortex could efficiently spell words using a code
word that represented each English letter (for example, ‘alpha’ for ‘a’):
this approach leads to a character error rate of 6.13% and a speed of
29.4 characters per minute, and hence starts to provide a viable communication channel for such patients.
However, such invasive recordings face a major practical challenge: these high-quality signals require brain surgery and can be difficult to maintain chronically. Several laboratories have thus focused on
decoding language from non-invasive recordings of brain activity such
as magneto-encephalography (MEG) and electro-encephalography
(EEG). MEG and EEG are sensitive to macroscopic changes of electric
and magnetic signals elicited in the cortex, and can be acquired with
a safe and potentially wearable set-up19. However, these two devices
produce notoriously noisy signals that vary greatly across sessions
and across individuals20–22. It is thus common to engineer pipelines
that output hand-crafted features, which, in turn, can be learned by a
decoder trained on a single participant23–28.
In sum, decoding language from brain activity is, so far, limited
either to invasive recordings or to impractical tasks. Interestingly, both
of these approaches tend to follow a similar method: that is, (1) training
a model on a single participant and (2) aiming to decode a limited set
of interpretable features (Mel spectrogram, letters, phonemes, small
set of words).
Instead, here we propose to decode speech from non-invasive
brain recordings by using (1) a single architecture trained across a large
cohort of participants and (2) deep representations of speech learned
with self-supervised learning on a large quantity of speech data. We
focus the present work on speech perception in healthy volunteers
rather than speech production in patients to design a deep-learning
architecture that effectively addresses two core challenges: (1) the
fact that non-invasive brain recording can be extremely noisy and
variable across trials and across participants and (2) the fact that the
nature and format of language representations in the brain remain
largely unknown. For this, we introduce a ‘brain module’ and train
it with contrastive learning to align its representations to those of a
pretrained ‘speech module’, namely, wav2vec 2.0 (ref. 29) (Fig. 1). We
train a single model for all participants, sharing most of the weights
except for one participant-specific layer. Figure 1 provides a broad
overview of our approach.
To validate our approach, we curate and integrate four public MEG
and EEG datasets, encompassing the brain activity of 175 participants
passively listening to sentences of short stories (see Table 1 for details).
For each MEG and EEG recording, we evaluate our model on its ability
to accurately identify the corresponding audio segment from a large
set of more than 1,500 segments (that is, ‘zero shot’ decoding).
This study provides three main contributions for the development
of a non-invasive method to decode speech from brain activity. First,
it shows how pretrained speech models can leverage the decoding of
speech in the brain, without exposing volunteers to a tedious repetition of every single word targeted by the decoder. Second, it shows
how specific design choices—including contrastive learning and our
multi-participant architecture (...truncated)