Decoding speech perception from non-invasive brain recordings (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s42256-023-00714-5.pdf

Decoding speech perception from non-invasive brain recordings

nature machine intelligence Article https://doi.org/10.1038/s42256-023-00714-5 Decoding speech perception from non-invasive brain recordings Received: 20 September 2022 Accepted: 4 August 2023 Published online: 5 October 2023 Check for updates Alexandre Défossez Jean-Rémi King 1,4 1 , Charlotte Caucheteux1,2, Jérémy Rapin1, Ori Kabeli3 & Decoding speech from brain activity is a long-awaited goal in both healthcare and neuroscience. Invasive devices have recently led to major milestones in this regard: deep-learning algorithms trained on intracranial recordings can now start to decode elementary linguistic features such as letters, words and audio-spectrograms. However, extending this approach to natural speech and non-invasive brain recordings remains a major challenge. Here we introduce a model trained with contrastive learning to decode self-supervised representations of perceived speech from the non-invasive recordings of a large cohort of healthy individuals. To evaluate this approach, we curate and integrate four public datasets, encompassing 175 volunteers recorded with magneto-encephalography or electro-encephalography while they listened to short stories and isolated sentences. The results show that our model can identify, from 3 seconds of magneto-encephalography signals, the corresponding speech segment with up to 41% accuracy out of more than 1,000 distinct possibilities on average across participants, and with up to 80% in the best participants—a performance that allows the decoding of words and phrases absent from the training set. The comparison of our model with a variety of baselines highlights the importance of a contrastive objective, pretrained representations of speech and a common convolutional architecture simultaneously trained across multiple participants. Finally, the analysis of the decoder’s predictions suggests that they primarily depend on lexical and contextual semantic representations. Overall, this effective decoding of perceived speech from non-invasive recordings delineates a promising path to decode language from brain activity, without putting patients at risk of brain surgery. Every year, traumatic brain injuries, strokes and neurodegenerative diseases cause thousands of patients lose their ability to speak or even communicate1–6. Brain–computer interfaces (BCIs) have raised high expectations for the detection4,5,7,8 and restoration of communication abilities in such patients9–14. Over recent decades, several teams have used BCIs to efficiently decode phonemes, speech sounds15,16, hand gestures11,12 and articulatory movements13,17 from electrodes implanted in the cortex or over its surface. For instance, Willett et al. 12 decoded 90 characters per minute (with a 94% accuracy, that is, roughly 15–18 words per minute) from a patient with a spinal-cord injury, recorded in the motor cortex during 10 hours of writing sessions. Similarly, Moses et al. 13 decoded 15.2 words per minute (with 74.4% accuracy, and using a vocabulary of 50 words) in a patient with anarthria and a BCI implanted in the sensori-motor cortex, recorded over 48 sessions 1 Meta AI, Paris, France. 2Inria Saclay, Saclay, France. 3Meta AI, Tel Aviv, Israel. 4LSP, Département d’Etudes Cognitives, École Normale Supérieure, PSL University, Paris, France. e-mail: ; Nature Machine Intelligence | Volume 5 | October 2023 | 1097–1107 1097 Article https://doi.org/10.1038/s42256-023-00714-5 spanning over 22 hours. Finally, Metzger et al. 18 recently showed that a patient with severe limb and vocal-tract paralysis and a BCI implanted in the sensori-motor cortex could efficiently spell words using a code word that represented each English letter (for example, ‘alpha’ for ‘a’): this approach leads to a character error rate of 6.13% and a speed of 29.4 characters per minute, and hence starts to provide a viable communication channel for such patients. However, such invasive recordings face a major practical challenge: these high-quality signals require brain surgery and can be difficult to maintain chronically. Several laboratories have thus focused on decoding language from non-invasive recordings of brain activity such as magneto-encephalography (MEG) and electro-encephalography (EEG). MEG and EEG are sensitive to macroscopic changes of electric and magnetic signals elicited in the cortex, and can be acquired with a safe and potentially wearable set-up19. However, these two devices produce notoriously noisy signals that vary greatly across sessions and across individuals20–22. It is thus common to engineer pipelines that output hand-crafted features, which, in turn, can be learned by a decoder trained on a single participant23–28. In sum, decoding language from brain activity is, so far, limited either to invasive recordings or to impractical tasks. Interestingly, both of these approaches tend to follow a similar method: that is, (1) training a model on a single participant and (2) aiming to decode a limited set of interpretable features (Mel spectrogram, letters, phonemes, small set of words). Instead, here we propose to decode speech from non-invasive brain recordings by using (1) a single architecture trained across a large cohort of participants and (2) deep representations of speech learned with self-supervised learning on a large quantity of speech data. We focus the present work on speech perception in healthy volunteers rather than speech production in patients to design a deep-learning architecture that effectively addresses two core challenges: (1) the fact that non-invasive brain recording can be extremely noisy and variable across trials and across participants and (2) the fact that the nature and format of language representations in the brain remain largely unknown. For this, we introduce a ‘brain module’ and train it with contrastive learning to align its representations to those of a pretrained ‘speech module’, namely, wav2vec 2.0 (ref. 29) (Fig. 1). We train a single model for all participants, sharing most of the weights except for one participant-specific layer. Figure 1 provides a broad overview of our approach. To validate our approach, we curate and integrate four public MEG and EEG datasets, encompassing the brain activity of 175 participants passively listening to sentences of short stories (see Table 1 for details). For each MEG and EEG recording, we evaluate our model on its ability to accurately identify the corresponding audio segment from a large set of more than 1,500 segments (that is, ‘zero shot’ decoding). This study provides three main contributions for the development of a non-invasive method to decode speech from brain activity. First, it shows how pretrained speech models can leverage the decoding of speech in the brain, without exposing volunteers to a tedious repetition of every single word targeted by the decoder. Second, it shows how specific design choices—including contrastive learning and our multi-participant architecture (...truncated)