The perception of intonational and emotional speech prosody produced with and without a face mask: an exploratory individual differences study
Sinagra and Wiener
Cognitive Research: Principles and Implications
https://doi.org/10.1186/s41235-022-00439-w
(2022) 7:89
ORIGINAL ARTICLE
Cognitive Research: Principles
and Implications
Open Access
The perception of intonational
and emotional speech prosody produced
with and without a face mask: an exploratory
individual differences study
Chloe Sinagra and Seth Wiener*
Abstract
Face masks affect the transmission of speech and obscure facial cues. Here, we examine how this reduction in acoustic and facial information affects a listener’s understanding of speech prosody. English sentence pairs that differed in
their intonational (statement/question) and emotional (happy/sad) prosody were created. These pairs were recorded
by a masked and unmasked speaker and manipulated to contain audio or not. This resulted in a continuum from
typical unmasked speech with audio (easiest) to masked speech without audio (hardest). English listeners (N = 129)
were tested on their discrimination of these statement/question and happy/sad pairs. We also collected six individual
difference measures previously reported to affect various linguistic processes: Autism Spectrum Quotient, musical
background, phonological short-term memory (digit span, 2-back), and congruence task (flanker, Simon) behavior.
The results indicated that masked statement/question and happy/sad prosodies were harder to discriminate than
unmasked prosodies. Masks can therefore make it more difficult to understand a speaker’s intended intonation or
emotion. Importantly, listeners differed considerably in their ability to understand prosody. When wearing a mask,
speakers should try to speak clearer and louder, if possible, and make intentions and emotions explicit to the listener.
Keywords: Face masks, Speech perception, Prosody, Intonation, Emotion, Individual differences, Autism, Memory
Significance statement
For surgeons and painters, communication in face masks
is common. For others, COVID-19 marked the beginning of talking (speech production) and listening (speech
perception) while wearing a mask. Masks can affect the
transmission of the speech signal and obscure facial cues.
This change in listening conditions has affected people
differently. What are some of the factors that cause this
individual variability in listeners? This study explored
that question in terms of speech prosody. The utterance “it’s raining” can be a statement (flat intonation) or
*Correspondence:
Language Acquisition, Processing, and Pedagogy Lab, Department of Modern
Languages, Carnegie Mellon University, Pittsburgh, PA, USA
a question (rising intonation). Prosody is often accompanied with facial cues, such as head tilts and eyebrow
raises. Masks can muffle speech cues and hide facial cues,
which can make prosody difficult to understand. Our
study found that masks make it harder to understand a
speaker’s statement/question intonational prosody and
happy/sad emotional prosody. Among the individual
differences we tested, we found that Autism Spectrum
Quotient predicted some performance on the prosody
discrimination task. The findings have potential educational and clinical implications. When speaking with
a mask, speakers should increase pitch and volume, if
possible. Because facial cues may be obscured, speakers
should also be more explicit about their intended emotions/questions (e.g., “I’m happy it’s raining.” “I have a
question: is it raining?”).
© The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this
licence, visit http://creativecommons.org/licenses/by/4.0/.
Sinagra and Wiener Cognitive Research: Principles and Implications
(2022) 7:89
Introduction
To fight the spread of the COVID-19 virus, facial mask
mandates were put in place by governments throughout
the world. For many people, this was the first time both
the speaker and listener wore masks during communication. Masks have acoustic and visual consequences.
Acoustically, the materials made to reduce the transmission of pathogens also reduce sound transmission (Magee
et al., 2020). As a result, masks can reduce a speaker’s
fundamental frequency (F0: what listeners perceive as
pitch) and amplitude (what listeners perceive as volume
or loudness). For many listeners, this reduction in acoustic information makes understanding speech more difficult (e.g., Brown et al., 2021; Fiorella et al., 2021; Mheidly
et al., 2020). Visually, a mask obscures the mouth and
hides facial cues. Visual information like mouth movements can help a listener better understand acoustic
information (e.g., Best, 1995; Fowler, 1986; Saunders
et al., 2021). For example, the relatively similar sounding English speech sounds /s/ and /ʃ/ differ in their liprounding, which listeners can use to better understand
whether the speaker needs to sip the bottle or ship the
bottle. For those listeners with hearing problems, communicating in noisy environments, and listening to nonnative speech, visual cues can be very helpful (Fiorella
et al., 2021; House et al., 2001; Sueyoshi & Hardison,
2005; Winn et al., 2013).
In the present study, we extend recent research into
masks and speech perception by examining the perception of speech prosody and masks. Prosody is a broad
term that includes pitch, stress, rhythm, and intonation
(e.g., Cutler, 2012; Cutler et al., 1997). It is often described
as not what a speaker says, but how it is said. For example, a student telling a friend, “Class is cancelled” could
convey happiness because it is a boring class or sadness
because it is the student’s favorite class. Acoustic cues
like F0 and amplitude (among others) change given the
prosody of the speech. Here, we examine intonational
statement/question prosodies and emotional happy/sad
prosodies produced with and without masks. Statements
are usually characterized by their relatively falling volume
and pitch, whereas questions are usually characterized
by their relatively rising volume and pitch (Gussenhoven
& Chen, 2000; Pell, 2001; Srinivasan & Massaro, 2003).
Happy speech is typically characterized by its relatively
high volume and high pitch; in contrast, sad speech is
typically characterized by its relatively low volume and
low pitch (Bänziger & Scherer, 2005; Scherer, 2003; Sobi (...truncated)