Tuesday, May 08, 2012

Lip-reading the emotions

And another BBC Future piece… I was interviewed on related issues recently (not terribly coherently, I fear) for the BBC’s See Hear programme for the deaf community. This in turn was a spinoff from my involvement in a really splendid documentary by Lindsey Dryden on the musical experiences of people with partial hearing, called Lost and Sound, which will hopefully get a TV airing some time soon.

I have no direct experience with cochlear implants (CIs) – electronic devices that partly compensate for severe hearing impairment – but listening to a simulation of the sound produced is salutary. It is rather like hearing things underwater: fuzzy and with an odd timbre, yet still conveying words and some other identifiable sounds. It’s a testament to the adaptability of the human brain that auditory information can be recognizable even when the characteristics of the sound are so profoundly altered. Some people with CIs can appreciate and even perform music.

The use of these devices can provide insights into how sound is processed in people with normal hearing – insights that can help us to identify what can potentially go wrong and how it might be fixed. That’s evident in a trio of papers buried in the recondite but infallibly fascinating Journal of the Acoustical Society of America, a publication whose scope ranges from urban noise pollution to whale song and the sonic virtues of cathedrals.

These three papers examine what gets lost in translation in CIs. Much of the emotional content, as well as some semantic information, in speech is conveyed by the rising and falling of voice – what is called prosody. In English, prosody can distinguish a question from a statement (at least before the rising inflection became fashionable). It can tell us if the speaker is happy, sad or angry. But because the pitch of sounds, as well as their ‘spectrum’ of sound frequencies, are not well conveyed by CIs, users may find it harder to identify such cues – they can’t easily tell a question from a statement, say, and they rely more on visual than auditory information to gauge a speaker’s emotional state.

Takayuki Nakata of Future University Hakodate in Japan and his coworkers have verified that Japanese children who are congenitally deaf but use CIs are significantly less able to identify happy, sad, and angry voices in tests in which normal hearers of the same age have virtually total success [1]. They went further than previous studies, however, in asking whether these difficulties inhibit a child’s ability to communicate emotion through prosody in their own speech. Indeed they do, regardless of age – an indication both that we acquire this capability by hearing and copying, and that CI users face the additional burden of being less likely to have their emotions perceived.

Difficulties in hearing pitch can create even more severe linguistic problems. In tonal languages such as Mandarin Chinese, changes in pitch may alter the semantic meaning of a word. CI users may struggle to distinguish such tones even after years of using the device, and hearing-impaired Mandarin-speaking children who start using them before they can speak are often scarcely intelligible to adult listeners – again, they can’t learn to produce the right sounds if they can’t hear them.

To understand how language tones might be perceived by CI users, Damien Smith and Denis Burnham of the University of Western Sydney in Australia have tested normal hearers with audio signals of spoken Mandarin altered to simulate CIs. The results were surprising [2].

Both native Mandarin speakers and English-speaking subjects do better in identifying the (four) Mandarin tones when the CI-simulated voices are accompanied by video footage of the speakers’ faces. That’s not so surprising: it’s well known that we use visual cues to perceive speech. But all subjects did better than random guessing with the visuals alone, and in this case non-Mandarin speakers did better than Mandarin speakers. In other words, native speakers learn to disregard visual information in preference for auditory. What’s more, these findings suggest that CI users could be helped by training them to recognize the visual cues of tonal languages: if you like, to lip-read the tones.

There’s still hope for getting CIs to convey pitch information better. Xin Luo of Purdue University in West Lafayette, Indiana, in collaboration with researchers from the House Research Institute, a hearing research centre in Los Angeles, has figured out how to make CIs create a better impression of smooth pitch changes such as those in prosody [3]. CIs do already offer some pitch sensation, albeit very coarse-grained. The cochlea, the pitch-sensing organ of the ear, contains a coiled membrane which is stimulated in different regions by different sound frequencies – low at one end, high at the other, rather like a keyboard. The CI creates a crude approximation of this continuous pitch-sensing device using a few (typically 16-22) electrodes to excite different auditory-nerve endings, producing a small set of pitch steps instead of a smooth pitch slope. Luo and colleagues have figured out a way of sweeping the signal from one electrode to the next such that pitch changes seem gradual instead of jumpy.

The cochlea can also identify pitches by, in effect, ‘timing’ successive acoustic oscillations to figure out the frequency. CIs can simulate this method of pitch discrimination too, but only for frequencies up to about 300 Hertz, the upper limit of a bass singing voice. Luo and colleagues say that a judicious combination of these two ways of conveying pitch, enabled by signal-processing circuits in the implant, creates a synergy that, with further work, should offer much improved pitch perception for users: enough, at least, to allow them to capture more of the emotion-laden prosody of speech.

1. T. Nakata, S. E. Trehub & Y. Kanda, Journal of the Acoustical Society of America 131, 1307 (2012).
2. D. Smith & D. Burnham, Journal of the Acoustical Society of America 131, 1480 (2012).
3. X. Luo, M. Padilla & D. M. Landsberger, Journal of the Acoustical Society of America 131, 1325 (2012).


JimmyGiro said...

"...native [Chinese] speakers learn to disregard visual information in preference for auditory."

I wonder if this is where the concept of 'Chinese inscrutability' comes from?

If a language uses regions of the brain that also deal with emotion, it may temporarily put facial expression on hold during speech.

Meanwhile, Westerners who use their left hemisphere exclusively for speech, especially 'predigested' speech, are free to use their right hemisphere to manipulate their facial expressions contemporaneously. To the point that we have lost the ability to 'sing' our language, as the facial cues displace the prosody somewhat.

Would this imply that Westerners are better at lying to your face, whilst the Chinese would be better at lying on the phone?

Philip Ball said...

I think "Chinese inscrutability" is a Western construct - our circumlocution tends to make us just as inscrutable to Chinese people, I believe. Having said that, Chinese people are awesomely, even admirably, good to lying to your face. I say this as a devout Sinophile, and I think my Chinese friends would agree. It's simply a cultural difference.