Thursday, December 16, 2010

All the world's words

Here's the pre-edited (but mostly identical) version of my story for Nature news on an intriguing paper in Science on data-mining of Google Books. There's the danger that in the wrong hands this kind of thing could end up supplanting textual and historical analysis with lexical statistics. But there's clearly a wealth of interesting stuff to be gleaned this way. And I thoroughly approve of a paper that is not afraid to show a sense of humour.

*********************************************************

The digitization of books by Google Books has provoked controversy over issues of copyright and book sales, but for linguists and cultural historians it could offer an unprecedented treasure trove. In a paper in Science[1], researchers at Harvard University and the Google Books team in Mountain View, California, herald a new discipline, called culturomics, which mines this literary bounty for insights into trends in what cultures can and will talk about through the written word.

Among the findings described by the collaboration, led by biologist Jean-Baptiste Michel at Harvard, are the size of the English language (around one million words in 2000), the typical ‘fame trajectories’ of well-known people, and the literary signatures of censorship such as that imposed by the German Nazi government.

‘The possibilities with such a new database, and the ability to analyze it in real time are really exciting’, says linguist Sheila Embleton of York University in Canada. She concurs with the authors’ claim that culturomics offers ‘a new type of evidence in the humanities.’

‘Quantitative analysis of this kind can reveal patterns of language usage and of the salience of a subject matter to a degree that would be impossible by other means’, agrees historian Patricia Hudson of Cardiff University in Wales.

‘The really great aspect of all this is using huge databases, but they will have to be used in careful ways, especially considering alternative explanations and teasing out the differences in alternatives from the database,’ says Royal Skousen, a linguist at Brigham Young University in Provo, Utah. But he is not won over by the term ‘culturomics’: ‘It smacks too much of ‘freakonomics’, and both terms smack of amateur sociology.’

Using statistical and computational techniques to analyse vast quantities of data in historical and linguistic research is nothing new in itself – the fields called quantitative history and quantitative linguistics are well established. But it is the sheer volume of the database created by Google Books that sets the new work apart.

So far, Google has digitized over 15 million books, representing about 12 percent of all those ever published. Michel and his colleagues performed their analyses on just a third of this sample, selected on the basis of the good quality of the digitization via optical character recognition and reliable information about the provenance, such as the date and place of publication.

The resulting data set contained over 500 billion words, mostly in English. This is far more than any single person could read: a fast reader would, without breaks for food and sleep, need 80 years to finish the books for the year 2000 alone.

Not all isolated strings of characters in texts are real words – some are common numbers, others abbreviations or typos. In fact, 51 percent of the character strings in 1900, and 31 percent in 2000, were ‘non-words’. ‘I really have trouble believing that’, admits Embleton. ‘If it’s true, it would really shake some of my foundational thoughts about English.’

By this count, the English language has grown by over 70 percent during the past 50 years, and around 8,500 new words are being added each year. Moreover, only about half of the words currently in use are apparently documented in standard dictionaries. ‘That high amount of lexical ‘dark matter’ is also very hard to believe, and would also shake some foundations’ says Embleton, adding ‘I’d love to see the data.’

In principle she can, because the researchers have made their database public. This will allow others to explore the huge number of potential questions it suggests, not just about word use but about cultural history. Michel and colleagues offer two such examples, concerned with fame and censorship.

They say that actors reach their peak of fame, as recorded in references to names, around the age of 30, while writers take a decade longer but achieve a higher peak. ‘Science is a poor route to fame’, they say. Physicists and biologists who achieve fame do so only late in life, while ‘even at their peak, mathematicians tend not to be appreciated by the public.’

Nation-specific subsets of the data can show how references to ideas, events or people drop out of sight due to state suppression. For example, the Jewish artist Marc Chagall virtually disappears from German writings in 1936-1944 (while remaining prominent in the English language), and ‘Trotsky’ and ‘Tiananmen Square’ similarly vanish in Russian and Chinese works respectively. The authors also look at trends in references to feminism, God, diet and evolution.

‘The ability, via modern technology, to look at just so much at once really opens horizons’, says Embleton. However, Hudson cautions that making effective use of such a resource will require skill and judgement, not just number-crunching.

‘How this quantitative evidence is generated – in response to what questions – and how it is interpreted are the most important factors in forming conclusions’, she says. ‘Quantitative evidence of this kind must always address suitably framed general questions, and employed alongside qualitative evidence and reasoning, or it will not be worth a great deal.’

Reference
1. Michel, J.-B. et al. Science doi:10.1126/science.1199644.

3 comments:

William said...

‘It smacks too much of ‘freakonomics’, and both terms smack of amateur sociology.’

Ouch!

This is a fun toy indeed.

Interesting finds,

1. More political books when liberals are in office?
http://ngrams.googlelabs.com/graph?content=Liberal%2Cliberal%2C+Conservative%2C+conservative&year_start=1920&year_end=2008&corpus=0&smoothing=3

2. I predict a reversal...
http://ngrams.googlelabs.com/graph?content=Economics%2C+economics&year_start=1920&year_end=2008&corpus=0&smoothing=3

3. Keep fighting the Good Fight, Goldacre
http://ngrams.googlelabs.com/graph?content=Homeopathy%2C+homeopathy&year_start=1920&year_end=2008&corpus=0&smoothing=3

3. The Harry Potter Effect
http://ngrams.googlelabs.com/graph?content=Wizard%2C+wizard&year_start=1920&year_end=2008&corpus=0&smoothing=3

Alright, I'm done. This thing is going to waste so much of my time in the foreseeable future. And to think, I was about to start Chapter 10 of The Music Instinct.

William said...

And yes, I am incapable of counting to four.

Philip Ball said...

That's very neat William. I am not going to go there and start looking, or I'll never come away. But what a resource.

Chapter 10 is the one I liked best. Hope you enjoy it.