Breaking news! A recent study found that Barack Obama is, with high probability, not an American citizen! The study — destined to revive the controversy that emerged during the President’s first presidential campaign — is based on new evidence and a simple analysis using widely accepted statistical inference tools. I’ll leave it to the political pundits to analyze the grave effects that this shocking finding surely will have on the upcoming presidential campaign. This post focuses on the elegant technical machinery used to reach the unsettling conclusion.

The crux of the analysis applies, in a statistical setting, modus tollens, a basic inference rule of logic. Given two facts X and Y such that if X is true then Y is true, modus tollens derives the falsehood of X from the falsehood of Y. In formal notation:

  \begin{matrix}  X \Longrightarrow Y, \quad \neg Y \\  \hline  \neg X  \end{matrix}

For example, take X to be “It rains” and Y to be “I have an umbrella with me”. From the fact that I am carrying no umbrella, by applying modus tollens, you can conclude that it’s not raining.

The next step introduces a simple generalization of modus tollens to the case where facts are true with some probability: if X is true then Y is true with high probability. Then, when Y happens to be false, we conclude that X is unlikely to be true. If I have an umbrella with me 99% of the times when it rains, there’s only a 1% chance that it rains if I have no umbrella with me.

All this is plain and simple, but it has surprising consequence when applied to the presidential case. A randomly sampled American citizen is quite unlikely to be the President; the odds are just 1 in 321-something millions. So we have that if “a person p is American” (or X) is true then “p is not the President” (or Y) is true with high probability. But Mr. Barack Obama happens to be the President, so he’s overwhelmingly unlikely to be American according to probabilistic modus tollens!

(The ironic part of the post ends here.)

Sure you’re thinking that this was a poor attempt at a joke. I would agree, were it not the case that the very same unsound inference rule is being applied willy-nilly in countless scientific papers in the form of statistical hypothesis testing. The basic statistical machinery, which I’ve discussed in a previous post, tells us that, under a null hypothesis H_0, a certain data D is unlikely to happen. In other words: if “the null hypothesis H_0” is true then “the data is different than D” is true with high probability. So far so good. But then the way this fact is used in practice is the following: if we observe the unlikely D in our experiments, we conclude that the null hypothesis is unlikely, and hence we reject it — unsoundly! How’s that for a joke?

Having seen for ourselves that modus tollens does not generalize to probabilistic inference, what is a correct inference from data to hypothesis testing? We can use Bayes’s theorem and phrase it in terms of conditional probabilities. P(X \mid Y) is the probability that X occurs given that Y has occurred. Then P(H_0 \mid D) — the probability that the null hypothesis H_0 is true given that we observed data D — is computed as P(D \mid H_0) \cdot P(H_0) / P(D). Even if we know that D is unlikely under the null hypothesis — P(D \mid H_0) is small — we cannot dismiss the null hypothesis with confidence unless we know something about the absolute prior probabilities of H_0 and D. To convince ourselves that Bayes’s rule leads to sound inference, we can apply it to the Barack Obama case: H_0 is “a person p is American” and D is “p is the President”. We plug the numbers in and do the simple math to see that P(H_0 \mid D), the probability that the President is American, is indeed one:

(1 / A) \cdot (A / W) / (1 / W) = 1, where A is the population of the USA and W is the world population. Bayes 1 – birthers 0.

Now you understand the fuss about statistical hypothesis testing that has emerged in numerous experimental sciences. Sadly, this blunder is not merely a possibility; it is quite likely that it has affected the validity of numerous published experimental “findings”. In fact, the inadequacy of statistical hypothesis testing is compounded by other statistical results such as the arbitrariness of a hard-and-fast confidence threshold, the false hypothesis paradox (when studying a rare phenomenon, that is a phenomenon with low base rates, most positive results are false positives), and self-selection (the few research efforts that detect some rare phenomenon will publish, whereas the overwhelming majority of “no effect” studies will not lead to publication). In an era of big data, these behaviors are only becoming more likely to emerge.

The take home message is simple yet important. Statistical hypothesis testing is insufficient, by itself, to derive sound conclusions about empirical observations. It must be complemented by other analysis techniques, such as data visualization, effect sizes, confidence intervals, and Bayesian analysis. Unless you remain convinced that Obama’s not American, Elvis is alive, and the Apollo moon landings were staged. In this case, this blog is not for you — with high probability.