Thursday, December 16, 2010

All the world's words

Here's the pre-edited (but mostly identical) version of my story for Nature news on an intriguing paper in Science on data-mining of Google Books. There's the danger that in the wrong hands this kind of thing could end up supplanting textual and historical analysis with lexical statistics. But there's clearly a wealth of interesting stuff to be gleaned this way. And I thoroughly approve of a paper that is not afraid to show a sense of humour.


The digitization of books by Google Books has provoked controversy over issues of copyright and book sales, but for linguists and cultural historians it could offer an unprecedented treasure trove. In a paper in Science[1], researchers at Harvard University and the Google Books team in Mountain View, California, herald a new discipline, called culturomics, which mines this literary bounty for insights into trends in what cultures can and will talk about through the written word.

Among the findings described by the collaboration, led by biologist Jean-Baptiste Michel at Harvard, are the size of the English language (around one million words in 2000), the typical ‘fame trajectories’ of well-known people, and the literary signatures of censorship such as that imposed by the German Nazi government.

‘The possibilities with such a new database, and the ability to analyze it in real time are really exciting’, says linguist Sheila Embleton of York University in Canada. She concurs with the authors’ claim that culturomics offers ‘a new type of evidence in the humanities.’

‘Quantitative analysis of this kind can reveal patterns of language usage and of the salience of a subject matter to a degree that would be impossible by other means’, agrees historian Patricia Hudson of Cardiff University in Wales.

‘The really great aspect of all this is using huge databases, but they will have to be used in careful ways, especially considering alternative explanations and teasing out the differences in alternatives from the database,’ says Royal Skousen, a linguist at Brigham Young University in Provo, Utah. But he is not won over by the term ‘culturomics’: ‘It smacks too much of ‘freakonomics’, and both terms smack of amateur sociology.’

Using statistical and computational techniques to analyse vast quantities of data in historical and linguistic research is nothing new in itself – the fields called quantitative history and quantitative linguistics are well established. But it is the sheer volume of the database created by Google Books that sets the new work apart.

So far, Google has digitized over 15 million books, representing about 12 percent of all those ever published. Michel and his colleagues performed their analyses on just a third of this sample, selected on the basis of the good quality of the digitization via optical character recognition and reliable information about the provenance, such as the date and place of publication.

The resulting data set contained over 500 billion words, mostly in English. This is far more than any single person could read: a fast reader would, without breaks for food and sleep, need 80 years to finish the books for the year 2000 alone.

Not all isolated strings of characters in texts are real words – some are common numbers, others abbreviations or typos. In fact, 51 percent of the character strings in 1900, and 31 percent in 2000, were ‘non-words’. ‘I really have trouble believing that’, admits Embleton. ‘If it’s true, it would really shake some of my foundational thoughts about English.’

By this count, the English language has grown by over 70 percent during the past 50 years, and around 8,500 new words are being added each year. Moreover, only about half of the words currently in use are apparently documented in standard dictionaries. ‘That high amount of lexical ‘dark matter’ is also very hard to believe, and would also shake some foundations’ says Embleton, adding ‘I’d love to see the data.’

In principle she can, because the researchers have made their database public. This will allow others to explore the huge number of potential questions it suggests, not just about word use but about cultural history. Michel and colleagues offer two such examples, concerned with fame and censorship.

They say that actors reach their peak of fame, as recorded in references to names, around the age of 30, while writers take a decade longer but achieve a higher peak. ‘Science is a poor route to fame’, they say. Physicists and biologists who achieve fame do so only late in life, while ‘even at their peak, mathematicians tend not to be appreciated by the public.’

Nation-specific subsets of the data can show how references to ideas, events or people drop out of sight due to state suppression. For example, the Jewish artist Marc Chagall virtually disappears from German writings in 1936-1944 (while remaining prominent in the English language), and ‘Trotsky’ and ‘Tiananmen Square’ similarly vanish in Russian and Chinese works respectively. The authors also look at trends in references to feminism, God, diet and evolution.

‘The ability, via modern technology, to look at just so much at once really opens horizons’, says Embleton. However, Hudson cautions that making effective use of such a resource will require skill and judgement, not just number-crunching.

‘How this quantitative evidence is generated – in response to what questions – and how it is interpreted are the most important factors in forming conclusions’, she says. ‘Quantitative evidence of this kind must always address suitably framed general questions, and employed alongside qualitative evidence and reasoning, or it will not be worth a great deal.’

1. Michel, J.-B. et al. Science doi:10.1126/science.1199644.

Thursday, December 09, 2010

Debye's dirty hands?

I have written a news story for Nature on new findings about the life of Peter Debye, who has been accused recently of colluding with the Nazis in the run-up to the Second World War. It’s very rich material (even if the new ‘revelations’ are rather indirect and add only a speculative element to the tale); I have written a piece on this for Chemistry World too, but had better wait for that to appear before posting it here. This pre-edited version is not as well structured as the final story, but contains more of the details and anecdotes, so here it is anyway. This is clearly an issue on which feelings run high, so I look forward (I think) to the feedback.


Peter Debye, the Dutch 1936 chemistry Nobel Laureate recently discredited by allegations of being a Nazi sympathizer, could in fact have been an anti-Nazi informer to the Allies during the approach to the Second World War, according to a new analysis of his private correspondence.

In a paper in the journal Ambix, retired chemist Jurrie Reiding in the Netherlands describes archival documents suggesting that Debye might have supplied information to a spy for the British intelligence agency MI6 in Berlin [1].

Although the new evidence is circumstantial, it adds to a mounting case for rehabilitating Debye’s name. When the Nazi links and accusations of anti-Semitism were asserted four years ago, two Dutch universities expunged Debye’s name from a research institute and an annual prize. The new paper ‘is an important and welcome contribution to the debate, which can help in arriving at a more balanced judgement’, says Ernst Homburg, a science historian at the University of Maastricht.

Debye, who worked for most of his pre-war career in Germany, became chairman of the German Physical Society (DPG) in 1937. Four years earlier, a law introduced by Hitler’s Nazi regime demanded the dismissal of all Jewish university professors. Among those who lost their posts was the pioneering nuclear physicist Lise Meitner at the University of Berlin.

In December 1938 the DPG board decided to expel the few remaining Jewish members. Debye sent a letter to members explaining this, citing ’circumstances beyond our control’ and signing off with ‘Heil Hitler!’ ‘Under the circumstances of those days, it was almost impossible not to write such a letter’, says Homburg.

Nonetheless, when this letter was described in an article titled ‘Nobel Laureate with dirty hands’ published in the Dutch newspaper Vrij Nederland in January 2006, in association with a book (in Dutch) called Einstein in Nederland by the journalist Sybe Rispens, the ensuing media controversy caused such alarm that the University of Utrecht removed Debye’s name from the institute for nanomaterials science, while the University of Maastricht in Debye’s home town withdrew its involvement in the annual Debye Prize for scientific research, sponsored by industrial benefactors the Hustinx Foundation.

This caused a storm of protest, not least from the researchers of the former Debye Institute in Utrecht. Chemist Héctor Abruña of Cornell University, where Debye worked after coming to the US in 1940 criticized the ‘rush to judgement’ and said that a university enquiry there found no evidence for the allegations.

As a result the Dutch Ministry of Education commissioned the Dutch Institute for War Documentation (NIOD) to investigate the Debye affair. Its report, released in 2007, softened the accusations to say that Debye had been guilty of ‘opportunism’ under the Nazis, but accused him of ‘keeping the back door open’ by secretly sustaining contacts with Nazi Germany while in the US.

All the same, in 2008 the Dutch government committee advised the universities of Utrecht and Maastricht to continue using Debye’s name, since the evidence of his ‘bad faith’ was equivocal. The Debye Institute at Utrecht was reinstated, and the Maastricht prize is due to be awarded again again next year. However, according to historian of chemistry Peter Morris, who edits Ambix, ‘in the Netherlands and to a lesser extent the USA this affair severely damaged Debye’s reputation.’

Critics of the Dutch universities’ initial decision have cited various arguments why Debye should not be judged too harshly or rashly. When he was chosen by the resolutely anti-Nazi Max Planck to be director of the Kaiser Wilhelm Institute of Physics (KWIP) in Berlin – a post that he occupied from 1935 until 1939 – it was precisely because he was non-German and was thought able to resist Nazi interference. Debye insisted that the place be named the Max Planck Institute when it finally opened in 1938. When the Nazis objected, Debye covered the name carved in stone over the entrance with a wooden plank – a pun that worked in German too.

And Debye accepted his Nobel Prize against the explicit wishes of the Nazis, who had commanded all Germans not to do so. He helped Meitner escape to Holland in 1938, and the Nazis opposed Debye’s chairmanship of the DPG because they considered him too friendly towards Jews. In 1940 Debye sailed to the US to give a series of prestigious lectures at Cornell – where he then stayed until his death in 1966. He aided the US war effort enthusiastically, especially through his work on polymers and synthetic rubber.

‘There were already enough arguments for Debye’s ‘rehabilitation’ before this article’, says Homburg, who calls Risbens’ book ‘heavily flawed’. But now Reiding adds a new narrative to the defence.

Debye, he says, was a friend of Paul Rosbaud, an Austrian working at the KWIP in Berlin, who was recruited by the British secret service to supply scientific information including details of the development of the V1 and V2 rockets and the German attempts to develop an atomic bomb. Rosbaud, who loathed the Nazis, remained in Berlin throughout the war, although even now information about his activities under the codename ‘Griffin’ remain classified.

Because of his consultancy with the academic Berlin publisher Springer Verlag, Rosbaud was very well connected in German science and knew Debye since at least 1930. He too played a key role in getting Meitner out of Germany, and Debye maintained the relationship with Rosbaud after the war. ‘The close friendship between Rosbaud and Debye makes it almost unquestionable that Debye was an anti-Nazi’, Reiding says.

And he points out that, as testified by other scientists to the FBI in the 1940s, Debye would have been party to some highly sensitive information about the German war technology during his time in Berlin. ‘Therefore’, Reiding says, ‘the hypothesis that Debye was a secret informant for Rosbaud does not appear too bold.’

Although Morris thinks that ‘further evidence would be needed before this case could be proved beyond doubt’, he adds that ‘I feel that there was a rush to judgement that not only failed to take into account all the aspects of Debye’s complex life but also failed to give full weight to the ambiguous nature of life under Nazi rule.’

Others question whether the new details add much to the story. ‘There seem to be two camps: those who hate Debye and deplore his actions as president of the DPG, and those who think he was a saint’, says Henk Lekkerkerker of the Debye Institute. ‘Both opinions are misleading, and the professional historians paint a more subtle and accurate picture.’

Perhaps ultimately a clue to Debye’s position lies in a letter that he wrote to the physicist Arnold Sommerfeld in December 1939, just before he left Germany for good. His aim, he said, was ‘not to despair and always be ready to grab the Good which whisks by, without granting the Bad any more room than is absolutely necessary. That is a principle of which I have already made much use.’

1. J. Reiding, Ambix 57, 275-300 (2010).