Nate Silver, Probabilistic Celebrity

Image courtesy of Randy Stewart via Wikimedia Commons

Tuesday night was quite spectacular for Nate Silver. I’m referring not to my suspicion that the oft-bespectacled, apparently liberal-leaning statistician was pleased with the outcome of the presidential race – though there was that – but rather the accompanying mighty boost to his career as his highest likelihood prediction of the election results proved spot on.

Silver’s predictions were based on what might be termed a “sophisticated form of poll aggregation.” His calculations incorporated the results of many polls, which were scrubbed and weighted based on factors such as historical accuracy of each polling firm, and his probability forecasts were derived from repeated election simulations (biostatistician Bob O’Hara speculated on some of the details). Note that Silver wasn’t the only one who applied statistics to polling data with impressive results; for example, check out Princeton Election Consortium’s forecast.

In the end, while the presidential race was declared by many prominent pundits as a dead heat or favoring Romney, the non-pundit’s New York Times blog served as a Silver lining for Democrats. Designer Michael Cosentino of gdgt.com published impressive-looking side-by-side maps of Silver’s predictions and actual results once most contests had been called by the major networks:

For correctly predicting all fifty states, Nate Silver was proclaimed the winner of the election (or at least a Silver medalist), a “poster child” of data in politics, “king of the quants,” the “patron saint” of big data, and even “a near-god for nervous Democrats.” Sales of his book, The Signal and the Noise, experienced a healthy spike. Even before the election, Twitter had been exploding with love for Nate, resulting in among many other things a surge of activity in the Chuck Norris-esque hashtag #NateSilverFacts and the parody accounts @fivethirtygreat, @fivethirtynate, and @drunknatesilver. Multiple domains have been registered in his honor, including NateSilverFacts.com and IsNateSilverAWitch.com, whose homepage reads:

Blind hate (for Nate)

Not all of the press has been as positive as labeling him a magical being. But most of the vitriol was unsubstantiated and suffered from a poor grasp of statistics and reason. New York Times columnist David Brooks said that “pollsters tell us what’s happening now. When they start projecting, they’re getting into silly land” and wrote that “even experts with fancy computer models are terrible at predicting human behavior” (to which technosociologist Zeynep Tufekci had a nice retort: “But experts with fancy computer models are good at predicting many things in the aggregate”). MSNBC’s Joe Scarborough referred to Nate Silver as an “ideologue” and a “joke.” On National Review Online, Josh Jordan claimed without substantiation that Silver’s partisanship “shows in the way he forecasts the election” (note that the article Jordan cites to demonstrate that Silver was “openly rooting” for Obama in 2012 talks only of the 2008 election). Slate’s Daniel Engber completely missed the point when he wrote that “Nate Silver didn’t nail it; the pollsters did.”

One of the worst offenders was Politico’s Dylan Byers, who wrote:

Prediction is the name of Silver’s game, the basis for his celebrity. So should Mitt Romney win on Nov. 6, it’s difficult to see how people can continue to put faith in the predictions of someone who has never given that candidate anything higher than a 41 percent chance of winning (way back on June 2) and – one week from the election – gives him a one-in-four chance, even as the polls have him almost neck-and-neck with the incumbent.

Ezra Klein has a nice explanation of one of the many flaws in Byer’s logic:

If Mitt Romney wins on election day, it doesn’t mean Silver’s model was wrong. After all, the model has been fluctuating between giving Romney a 25 percent and 40 percent chance of winning the election. That’s a pretty good chance! If you told me I had a 35 percent chance of winning a million dollars tomorrow, I’d be excited. And if I won the money, I wouldn’t turn around and tell you your information was wrong. I’d still have no evidence I’d ever had anything more than a 35 percent chance.

MIT Knight Science Journalism Tracker’s media critic Paul Raeburn had a similar reaction:

Let’s compare Silver’s work to a weather forecast. As of Nov. 4, Silver gives Obama an 86.3 percent chance of winning the election. If a meteorologist said there was an 86 percent chance of rain – and it didn’t rain – Byers would presumably “not continue to put faith in the predictions” of the weather forecaster. But we know that’s not right. Forecasts are generally correct – but not always. That does not make them worthless. When there is an 86 percent change of rain, most of us grab an umbrella. And we should.

Liberal satire blog Wonkette also posted a direct reaction to Byer’s troubling piece:

we would like to urge everyone to go read this Politico piece again about how dumb and wrong Nate is and how math and numbers are ruining political punditry forever, and then laugh and laugh at how upset people were by the concept that you could tell how an election might turn out by asking people in advance how they’ll vote and then figuring a way to accurately assess the answers they give.

Silver simply critically interpreted people’s answers to the question of how they plan to vote in a scientific manner. Sounds like a prime example of turning information into knowledge to me. Data science for the win.

This shouldn’t have been upsetting to people. Tufekci eloquently sums up why:

We rely on statistical models for many decisions every single day, including, crucially: weather, medicine, and pretty much any complex system in which there’s an element of uncertainty to the outcome. In fact, these are the same methods by which scientists could tell Hurricane Sandy was about to hit the United States many days in advance. Dismissing predictive methods is not only incorrect; in the case of electoral politics, it’s politically harmful.

She also has a message for the haters:

Refusing to run statistical models simply because they produce probability distributions rather than absolute certainty is irresponsible…. We should defend statistical models because confusing uncertainty and variance with “oh, we don’t know anything, it could go any which way” does disservice to important discussions we should be having on many topics – not just on politics.

Blind love

All of this is not to say that articles that came out in favor of Silver were flawless. Quite the opposite was true; quantitative exceptionalism abounded on both sides of the debate. His predictions were perceived as exceptional (in some cases better, in others worse) because they came from science, which has a habit of making people’s brains turn off. As Paul Raeburn wrote, “Nate Silver’s rational approach to politics seems to provoke highly irrational responses.” Many voices on both sides of the debate conflated likelihood of winning and victory margin, for example. Silver predicted a high likelihood of Obama winning by a fairly slim margin, not a landslide victory.

Science making people's brains turn off (image courtesy of fancylad)

One critical fact that is largely absent from the conversation is that it was partially luck that Silver’s predictions were spot on for all fifty states. If we assume that the state-by-state results as predicted in Silver’s November 6 presidential forecast were independent events, he had a 12.5% chance of getting all fifty correct (the product of all of his state win likelihoods). Granted, this represents much better odds than the approximately one in a quadrillion (1 in 1,000,000,000,000,000) chance of guessing all fifty correct by flipping a coin (50% raised to the 50th power), but Silver’s perfect sweep can’t be attributed to his skill alone.

In effect, being celebrated for this accomplishment makes Silver a probabilistic celebrity. He seems like a pretty rational guy, so I’m guessing he’s more than mildly amused by how much credit he’s gotten for his probabilistic predictions ending up correct. Bowing down to him for this reason is at its core a quantitative exceptionalist act.

Silver’s biggest victory might have been for making a case by example for moving more toward science in politics, away from gut-instinct-based punditry. Pundits are only human, and humans are known for being biased and otherwise poorly calibrated estimators (e.g. see Douglas Hubbard’s How to Measure Anything); Silver himself characterizes subjective pundit predictions as “literally no better than flipping a coin.” Slate decried the poor accuracy of pundits’ predictions in this election, and several media outlets have questioned whether punditry is dead as a result of Silver’s superior methods. But keep in mind that prediction is hard; for some people, even predicting the past can be tricky. At the very least, some good humor has come out of the discussion, such as the election drinking game proposed by Brian Fadden of The New York Times: “Drink until you’re as dumb as a pundit.”

Statisticians would most likely have beaten pundits at predicting a domestic feline’s surprising third place finish in Virginia’s senate race had they been looking for it, but this is not to say that qualitative analysis lacks value. To the contrary, designing and interpreting polls requires it. It is the marriage of qualitative and quantitative analysis that makes statisticians like Silver so successful. Critical inquiry is the key ingredient. Blindly following anyone – either pundits or statisticians – can get you into trouble. In fact, many pundits get themselves into trouble for their own lack of skepticism. A Forbes.com article titled “Nate Silver’s Prediction Was Awesome – But Don’t Build Statues To ‘Algorithmic Overlords’ Just Yet” includes the fair and balanced passage:

Discrediting expert predictions [pundits] seems like real progress – but not if we believe the enduring lesson is to replace one group of fortune tellers with another.  We should certainly strive to use big data and rigorous math whenever we can – but let’s be careful not to fall into the trap of letting down our guard, and trusting all experts who come bearing algorithms.

This isn’t about picking the right side. It’s about being rational.

Eat chocolate -> get wicked smaht?

From Messerli, Frank. Chocolate Consumption, Cognitive Function, and Nobel Laureates, New England Journal of Medicine 2012, 367:1562-1564, October 18, 2012

This post first appeared on Bittersweet Notes.

I managed to score last week’s issue of absurdist scientific humor publication The New England Journal of Medicine, which includes a hilarious note on “Chocolate Consumption, Cognitive Function, and Nobel Laureates.” As I continued reading the issue and failed to see the humor in such knee-slappers as “Fibulin-3 as a Blood and Effusion Biomarker for Pleural Mesothelioma” and “Evaluation and Initial Treatment of Supraventricular Tachycardia,” I quickly came to the realization that NEJM is not intended as a satirical magazine. It is, in fact, among the world’s most prestigious peer-reviewed medical journals.

Inspired by recent findings that compounds in chocolate improve cognitive function, cardiologist Franz Messerli’s note questions whether there is “a correlation between between a country’s level of chocolate consumption and its population’s cognitive function.” Using the number of Nobel laureates per capita as a “surrogate end point” for a population’s percentage of wicked smahties, the study finds a “surprisingly powerful correlation between chocolate intake and the number of Nobel laureates in various countries” (23 in all). While he concedes that correlation does not imply causation, Messerli writes “since chocolate consumption has been documented to improve cognitive function, it seems most likely that in a dose-dependent way, chocolate intake provides the abundant fertile ground needed for the sprouting of Nobel laureates.”

Hilarity.

Reportedly, when contacted by the Associated Press, “Sven Lidin, the chairman of the Nobel chemistry prize committee, had not seen the study but was giggling so much when told of it that he could barely comment.”

Indeed, one doesn’t require a doctorate in statistics to find serious flaws in the study. It was clearly intended as tongue-in-cheek to some degree by Messerli (who has according to NPR published around 800 peer reviewed papers) and NEJM (which also according to NPR has a history of occasional tomfoolery), though to what degree I can’t quite ascertain. Scientists’ riotous senses of humor aside, I would have expected dozens of more subtly troubling logical leaps to be followed by winky faces.

Given the absence of sufficient semicolon close parentheses, I worry about the misinformation generated by this study. The media has run wild with it in the past week, citing it widely with often far too little skepticism – an excellent example of a phenomenon I’ve recently started calling quantitative exceptionalism. A comment cardiologist Sanjay Kaul provided to CardioBrief sums up the dangers well: “This article highlights, with a touch of whimsy, caveats that challenge the interpretation of findings of observational studies. From the use of surrogate endpoints (based on biological plausibility and the results of preclinical studies) to the distinction between correlation and causation, confounding (whether the effect size is too large to be explained away by confounding), and the hypothesis-generating nature of the inferential process. Careful consideration of these issues is likely to help navigate through the labyrinth of misinformation and disinformation these types of studies are particularly prone to generating.”

Messerli is no stranger to the harmful effects scientific misinformation can have. Last year, he was quoted in a Wall Street Journal article about mistakes in scientific studies as one of a large number of doctors who (understandably) fell prey to an erroneous paper in the Lancet, another highly respected medical journal. Hundreds of thousands of patients were affected, and Messerli argued that the Lancet had a “moral obligation” to withdraw the paper. Granted, doctors around the world aren’t likely to begin writing prescriptions for dangerously high doses of chocolate based on Messerli’s note in NEJM any time soon, but the difference is one of magnitude rather than direction.

A few examples of things I found more troubling slash hilarious about Messerli’s note:

  • The use of the number of Nobel laureates as a surrogate endpoint for cognitive function is…how do I say it?…strange. In fact, the number of Nobel laureates probably has a lot more to do with a country’s wealth. As Nobel laureate Eric Cornell told Reuters, “National chocolate consumption is correlated with a country’s wealth and high-quality research is correlated with a country’s wealth…therefore chocolate is going to be correlated with high-quality research, but there is no causal connection there.”
  • Messerli writes: “Obviously, these findings are hypothesis-generating only and will have to be tested in a prospective, randomized trial.” Considering that countries in the study have at most a few Nobel laureates per million population, imagine the enormous expense, financial and otherwise, of such a trial. A properly controlled study would deprive millions of the joys of chocolate.
  • While the note warns in multiple places that causation has not been proven, its language repeatedly justifies causation based on tenuous logic. For example, Messerli writes that “it would take about 0.4 kg of chocolate per capita per year to increase the number of Nobel laureates in a given country by 1” and even refers to a “minimally effective chocolate dose.” He justifies such remarks only with references to prior studies linking cacao consumption and cognitive function, which are many leaps-of-faith removed from these conclusions.
  • Messerli writes but has no justification for this statement: “it is difficult to identify a plausible common denominator that could possibly drive both chocolate consumption and the number of Nobel laureates over many years. Differences in socioeconomic status from country to country and geographic and climatic factors may play some role, but they fall short of fully explaining the close correlation observed.”
  • The study appears to use chocolate rather than flavanol or cacao consumption figures, and the types of chocolate consumed in the studied countries varies significantly. Another gem from Cornell’s interview in Reuters: “It’s one thing if you want like a medicine or chemistry Nobel Prize, ok, but if you want a physics Nobel Prize it pretty much has got to be dark chocolate.” I wonder how considering less economically correlated forms of flavanols like green tea would change the results.

Is Messerli deserving of an Ig Nobel Prize for this gem? According to the Annals of Improbable Research, which awards the prizes annually: “Every Ig Nobel Prize winner has done something that first makes people LAUGH, then makes them THINK.”

Regardless, I’m left wondering what foods predispose you to becoming an Ig Nobel laureate. Foods that leave a funny taste in your mouth? Personally, I’m going to stick with salad.

Quantitative exceptionalism

Image courtesy of Wikimedia Commons

At its heart, the field of statistics deals with determining what inferences can be drawn from data. Causality, bias, significance, and experimental reproducibility are its lifeblood, and one doesn’t have to wander too many pages into a standard introductory statistics textbook before encountering these issues.

Most readers of this blog will not have too much trouble coming up with examples of real world situations in which the improper application of statistics can result in spurious conclusions. As a very simple example, if the average height of a population is estimated on the basis of a survey, and younger people (who tend to be shorter) have a lower response rate, the result may overestimate average height.

There is a substantial pop culture literature on such examples and how to avoid them (for instance, check out Darrell Huff’s classic How to Lie with Statistics or Joel Best’s Damned Lies and Statistics series). This phenomenon goes beyond statistics to all situations involving quantitative information or reasoning. Numbers and equations are apparently intoxicating to the uninitiated, like the narcotic lotus flowers of Greek mythology that reduced Odysseus’ crew to a state of peaceful apathy and nearly caused them to lose their way.

Too often, the curiosity and skepticism demonstrated by otherwise intelligent humans comes to a grinding halt when numbers are involved.

“Quantitative exceptionalism” is the widespread and often harmful belief that insights reached via quantitative means form an exceptional class. This term has both positive and negative connotations. Quantitative arguments are often assumed to be of high quality a priori, perhaps due to their relative inaccessibility, and those who employ them erudite. Humans are by nature fallible and what we do with numbers is subject to human error, yet people so often blindly trust quantitative arguments. Sometimes the errors are subtle; other times, not so much. The result is lower standards for scientific and mathematical rigor, with immense downstream impact.

Quantitative exceptionalism is widespread in academic, business, political, and popular discourse. In many scholarly disciplines, numerical data and quantitative arguments are given less scrutiny than their qualitative counterparts. In business and government, decisions are made on questionable calculations at an ever accelerating rate, fueled by a big data revolution that is a lot heavier on technology than it is on basic science. Educators in STEM fields could do more to encourage interrogation. Journalists using numbers and infographics could inquire more critically. Politicians…don’t get me started.

So don’t judge a number by its cover. Be curious. Be a skeptic. Avoid the lotus.

(For curious readers, the coinage “quantitative exceptionalism” is inspired by the terminology “American exceptionalism” and MIT linguist Michel DeGraff’s “Creole exceptionalism” [pdf].)

We’ll be talking a lot more about quantitative exceptionalism on this blog. In the meantime, share your thoughts or examples you’ve witnessed in the comments.

What do unstructured data and Santa Claus have in common?

Santa Claus as a huge balloon

Image courtesy of Bart Fields

There is no such thing as unstructured data. There, I said it.

Structure is inherent in the definition of data. No structure means no information means no data. Like “clearly misunderstood,” “unstructured data” is an oxymoron.

Some have proposed “semi-structured data” to overcome this logical issue, but this alternative is no less discriminatory. Whatever part of the data lacks structure has no informational content and is thus not data. Contradiction persists.

“Big data” is generally refers to data whose large size or structure or rate of change or complexity contributes to the difficulty of working with it in some way. “Unstructured data” refers to the subset of big data for which structure is part of the problem. These phrases are subjective and point to shortcomings in tools or users rather than something inherent about the data itself. They are highly context dependent and, as such, are often misused and misunderstood.

Which causes miscommunication. Unstructured communication, if you will.

Please regard this as a cease and desist letter for the use of the phrase “unstructured data” unless in a defamatory or humorous context (warning: it’s not that funny, either). We have plenty of words to describe specific instances of The Data Formally Known as Unstructured: “text,” “non-relational,” “the Web.”

If you absolutely need a term, perhaps “differently structured data” would be more somewhat more palatable to my fellow pedants. Another possibility, “multi-structured data,” is beginning to gain some momentum. Just keep the context crystal clear and we won’t come looking for you.

P.S. Apologies to any children whose Yuletide dreams were just crushed.

Quantrepreneurs

Image courtesy of Argonne National Laboratory

Google will learn just a tiny bit more about me (and you, the reader) from this post, enabling the search engine giant to (probabilistically) increase its bottom line through better targeted advertisements. You’re welcome, Google.

A hugely transformative data revolution is upon us. Machine readable information is being generated and captured at an astounding and rapidly accelerating rate. At the same time, a ballooning army of alchemists and applications attempt to transform it into value of various forms. Which results in even more data. Kaboom.

On the data generation side are technologies that digitize and enable the creation of new digital information: the vast human-generated data factories of the internet, increasingly sophisticated devices in our homes and handbags, and sensors galore in places most people don’t imagine.

On the value generation side are data scientists (also in places most people don’t imagine), enthusiasts, knowledge workers, and an ever expanding array of hardware and software.

At the center of the data revolution are quantrepreneurs, who innovate new ways to generate, capture, transform, and wring value from data, linking the two sides and propelling the revolution forward, evolving the information age into the age of actionable insight. This is their story.


Telling the full story of this revolution necessitates touching on many topics: data (of course), science, technology, engineering, math, business, innovation, entrepreneurship, privacy/transparency and the law, design, storytelling, technology, statistics, and more. Content will be wide ranging and will include case studies, opinions, thought experiments, predictions, rants, musings, and practical advice.

Non-specialists are welcome. In fact, this blog is really for you. Despite the recent explosion of online chatter about the data revolution, many sources still make it seem like magic. It’s not magic. Data Bitten aims to provide a voice that is significantly underrepresented in the conversation, by taking a first principles approach to the data revolution. We’ll demystify and debunk. We’ll be skeptical. We’ll expose the magic for what it is, simply good science or iterative engineering or a smart idea. The goal is to make this world accessible to a wider audience, and through a focus on the practical to enable and encourage greater understanding and participation.

Are you ready to get bitten by the data bug?

« Previous Page

  • About

    Data Bitten aims to tell the story of the data revolution. More to come.

  • Stay connected