Culture | Truth and statistics

How to find out what people really think

Data mining is becoming more and more precise

May 25th 2017

Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are. By Seth Stephens-Davidowitz. Dey Street; 288 pages; $27.99. To be published in Britain by Bloomsbury in July; £20.

TO MANY people Big Data is less shiny than it was a year ago. After Hillary Clinton’s defeat at the hands of Donald Trump, her vaunted analytics team took much of the blame for failing to spot warnings in the midwestern states that cost her the presidency. But according to research by Seth Stephens-Davidowitz, a former data scientist at Google, Mrs Clinton’s real mistake was not to rely too much on newfangled statistics, but rather too little.

Mrs Clinton used the finest number-crunchers. But their calculations still relied largely on traditional sources, such as voter files and polls. In contrast, Mr Stephens-Davidowitz turned to a novel form of data: Google searches. In particular, he counted the frequency of queries for the word “nigger”, America’s most toxic racial slur. Contrary to the popular perception that overt racism is limited to the South, the numbers showed comparatively high interest in the term across the Midwest and the rustbelt relative to the rest of the country. In the Republican primaries in 2016 that variable outperformed all others in predicting which geographic areas would support Mr Trump over his intraparty rivals. Had Mrs Clinton’s team made better use of such information, they might have concluded, before it was too late, that the foundations of her “blue firewall” were cracking.

This is just one of the striking findings in “Everybody Lies”, a whirlwind tour of the modern human psyche using search data as its guide. Some of the book’s discoveries reaffirm conventional wisdom, like the concentration of queries about do-it-yourself abortions and about men who are confused about their sexual orientation in America’s socially conservative South. Some turn it on its head: although rags-to-riches narratives are widespread in basketball, the data show that growing up in poverty actually reduces a boy’s chances of making the National Basketball Association—perhaps because poor children are less likely to grow tall enough to play in it. Some results are both disturbing and perplexing, such as the prevalence of searches on pornographic sites for videos depicting sexual violence against women, and the fact that women themselves seek out these scenes at least twice as often as men do. Other results are just weird: why are adult men in India so eager to have their wives breastfeed them?

The empirical findings in “Everybody Lies” are so intriguing that the book would be a page-turner even if it were structured as a mere laundry list. But Mr Stephens-Davidowitz also puts forward a deft argument: the web will revolutionise social science just as the microscope and telescope transformed the natural sciences.

Modern microeconomics, sociology, political science and quantitative psychology all depend to a large extent on surveys of at most a few thousand respondents. In contrast, he says, there are “four unique powers of Big Data”: it provides new sources of information, such as pornographic searches; it captures what people actually do or think, rather than what they choose to tell pollsters; it enables researchers to home in on and compare demographic or geographic subsets; and it allows for speedy randomised controlled trials that demonstrate not just correlation but causality. As a result, he predicts, “the days of academics devoting months to recruiting a small number of undergraduates to perform a single test will come to an end.” In their place, “the social and behavioural sciences are most definitely going to scale,” and the conclusions researchers will be able to reach are “the stuff of science, not pseudoscience”.

Mr Stephens-Davidowitz is not just any knee-jerk cheerleader for the Big Data revolution. He devotes ample space both to the ways that quantitative findings can lead decision-makers astray, and to the risk that the nearly omniscient owners of such data sets may find ways to abuse them. If liking motorcycles turns out to predict a lower IQ, he asks, should employers be allowed to reject job applicants who admit to liking motorcycles? As a result, he calls for extreme caution in extending the use of Big Data from large groups of people to making decisions about individuals. On the whole, however, the author is an optimist. As a result of improvements in information technology, he writes, humans will “be able to learn a lot more” about themselves “in a lot less time”.

This article appeared in the Culture section of the print edition under the headline "Truth, all the truth—and statistics"