even if our expectations are based on stereotypes rather
than authentic experience (McGowan, in press).
In Sumner, Kim, King, and McGowan (2014) we propose a model (above) of how the linguistic and the social aspects of speech interact to support perception. We propose that listeners process both phonetically cued social information and phonetically cued linguistic information prior to word recognition and that these dual routes can interact.
So does all this knowledge and sensitivity only apply to social variation?
First, some quick background on how sounds like [p], [t], and [k] differ from sounds like [b], [d], and [g] at the beginning of English words like pit and bit. What word is this native American English speaker saying?
The image to the left is a spectrogram (frequency analysis over time) of the word pit. Hear the puff of air at the beginning? It is highlighted in blue in the spectrogram.
pit and bit both start with the lips completely closed. One of the main differences between them is the duration of the puff of air, this duration is called VOT (voice onset time).
At least in American English, that puff of air is so important that cutting it out of pit (that first sound you played) results in a word that sounds a lot like bit —though probably with a funny [b], and that funniness is every bit as interesting and important as the change from [p] to [b]!
When we listen to speech we are phenomenally sensitive to covarying patterns
of phonetic detail. One such covarying pattern is the tendency for VOT
to be shorter in a fast speech style than in slower speech...
In fact, removing most of the VOT from [p], [t], and [k] words makes them
less useful to listeners (shortest green bar) in slow (Citation) speech, but
if the rest of the word is spoken quickly the short VOT sounds fine (Fast speech, on the right) (abstract).
Another covarying feature is the way vowels before nasal consonants in English tend to be nasalized. Listeners can use this as soon as it becomes available, not only a large distinction like bend/bed...
but also a much more subtle distinction like the difference in nasalization between these two sound files. Can you hear a difference?
This first recording has late nasalization starting 100 miliseconds after the [b].
This second recording has early nasalization starting 33 miliseconds after the [b].
In an eye tracking task we found that listeners can use nasalization as soon as it is present. Looks to the heavily-nasalized word were, on average, 60 ms faster —the same average difference between early and late nasalization in the recordings (Beddor, McGowan, Boland, Coetzee, and Brasher, 2013).
Whether the information is social, contextual, articulatory, or
idiosyncratic, we humans have an astonishing ability to attend
to it, remember it, and activate it during perception. This ability, my
research suggests, is not irrelevant to linguistic competence or
even peripheral to it, it is fundamentally and centrally part of
what it means to know and speak a human language.
Thank you for reading! If you have any questions, please contact me via e-mail, twitter, or carrier pigeon.
And many, many thanks to my friend Markus Nee for turning me into this cartoon.