Monday, January 24, 2011

How words get the message across


Here is the pre-edited version of my latest news article for Nature online, with a bit of extra stuff appended for which there was no room.
***********************************************************

Languages are adapted to deliver information efficiently and smoothly.

Longer words tend to carry more information, according to research by a team of cognitive scientists at the Massachusetts Institute of Technology.

It’s a suggestion that might sound intuitively obvious, until you start to think about it. Why, then, the difference in length between ‘now’ and ‘immediately’? For many years, linguists have tended to believe that word length depended primarily on how often the word is used – a relationship discovered in the 1930s by the Harvard linguist George Kingsley Zipf [1].

Zipf believed that this link between word length and frequency stemmed from an impulse to minimize the amount of time and effort needed for speaking and writing, since it means we use more short words than long ones. But Steven Piantadosi and colleagues say that, to convey a given amount of information, it is more efficient to shorten the least informative – and therefore the most predictable – words, rather than the most frequent ones.

Zipf’s relationship is roughly correct, as implied by how much more often ‘a’, ‘the’ and ‘is’ are used in English than, say, ‘extraordinarily’. And this relationship of length to use seems to hold up in many languages. Because written and spoken length are generally similar, it applies to both speech and text.

But after analysing word use in 11 different European languages, Piantadosi and colleagues found that word length was more closely correlated with their information content than with their usage frequency. They describe their results in the Proceedings of the National Academy of Sciences USA [2].

This is a landmark study”, says linguist Roger Levy of the University of California at San Diego. “Our understanding of the relationship between word frequency and length has remained relatively static since Zipf’s discoveries’, he says, and he feels that this new study may now supply “the largest leap forward in 75 years in our understanding of how principles of communicative efficiency govern the evolution of natural language lexicons.”

Measuring the information content of a word isn’t easy, especially because it can vary depending on the context. The more predictable a word is, the less informative it is. The word ‘nine’ in ‘A stitch in time saves nine’ contains less information than it does in the phrase ‘The word that you will hear is nine’, because in the first case it is highly predictable.

The MIT group devised a method for estimating the information content of words in digitized texts by looking at how it is correlated with – and thus, predictable from – the preceding words. For just a single preceding word, Piantadosi explains that “we count up how often all pairs of words occur together in sequence, such as ‘the man’, ‘the boy’, ‘a man’, ‘a tree’ and so on. Then we use this count to estimate the probability of a word conditioned on the previous word – or more generally, the probability of any word conditioned on any preceding sequence of a given number of words.” According to information theory, the information content is then proportional to the negative logarithm of this probability.

However, physicist Damián Zanette of the Centro Atómico Bariloche in Argentina, who has studied Zipf-type relationships in linguistics, is not persuaded that this method accurately captures the real information content of a word in context. This, he says, is typically determined by a span of several surrounding hundred words, not just a few [3].

Piantadosi and colleagues suggest that the relationship of word length to information content might not only make it more efficient to convey information linguistically but also make language cognition a smoother ride for the reader or listener. If shorter and briefer words carry less information, then the density of information throughout a phrase or sentence will be smoothed out, so that it is delivered at a roughly steady rate rather than in lumps. In this way, the results suggest how the lexical structure of language might aid communication.

Surprising though it may seem, some linguists have suggested previously that communication might not in fact be the primary purpose of language – Noam Chomsky, for example, has claimed that it is about establishing social relationships. Yet according to cognitive scientist Florian Jaeger of the University of Rochester in New York, these new results “suggest that communication is a sufficiently important aspect of language to shape it over time”.


References

1. Zipf, G. The Psychobiology of Language (Routledge, London, 1936).
2. Piantadosi, S. T., Tily, H. & Gibson, E. Proc. Natl Acad. Sci. USA 10.1073/pnas.1012551108 (2011).
3. Montemurro, M. A. & Zanette, D. H. Adv. Complex Syst. 13, 135-153 (2010).


Some further comments from Steven Piantadosi in response to my questions:

PB: In terms of the possible reasons for your central finding: are you suggesting that shorter words carry less information largely so that information tends to be rather evenly distributed through both text and (because of the relationship of orthographic to phonetic length) speech, i.e. the short, 'rapid-fire' words don't carry a lot of info and so don't impose a sudden high demand on cognitive processing?

SP: Yes, that's probably the most likely theory for what's going on. There are quite a few papers in psycholinguistics showing these kinds of effects (references 7,8,9,10,12 in the paper). In Levy & Jaeger, for instance, people insert optional syntactic elements like "that" in locations where there would otherwise be a peak in information content – inserting another word helps keep information per unit time lower.

PB: In this respect, what do the findings imply for the long-standing idea that language is a compromise between the needs of the speaker and those of the listener? It rather seems the balance here is in favour of the listener, who gets a smooth rather than lumpy informational stream, whereas the speaker has to do rather more speaking than if length depended primarily on frequency. Or does your idea also optimize the total amount (time) of speaking needed to convey a given amount of information, and so benefit the speaker too?

SP: This is a really interesting issue. It could be caused by speakers thinking about what listeners would want, or it could just reflect intrinsic properties of language production systems, or both. Speakers have more trouble accessing low frequency (probably also high information content) words, so I wouldn't say that this necessarily has to come from speakers designing speech for listeners. It's true that speakers have to do more speaking, but that also means they have more time to plan and produce their utterances. It also helps listeners by giving them more time to process. I don't think we know who it's really for, yet.

PB: Finally, more for my own curiosity than anything, I can't help wondering if anything of this sort works for Chinese. Obviously one tends to lose the phonetic/orthographic link there - and while commonly used words do sometimes have simpler written characters, this is not always so. Do you nonetheless expect to see any kind of relationship between information content and the number of strokes in the characters? Does any such thing then survive in speech patterns?

SP: Ah that's interesting. I'm not sure I would necessarily predict effects in Chinese orthography per se, but it would be interesting to look – it would be a neat case for seeing if there are actually influences on the writing system. In the current work, we used orthography largely as a proxy for phonetic length. Chinese has very many monosyllabic words so its not clear that word length has much variance to be explained there. That raises the interesting question of why Chinese is like that. It may be that information content is modulated in other ways in Chinese, but I don't know.

No comments: