Breaking news! A recent study found that Barack Obama is, with high probability, not an American citizen! The study — destined to revive the controversy that emerged during the President’s first presidential campaign — is based on new evidence and a simple analysis using widely accepted statistical inference tools. I’ll leave it to the political pundits to analyze the grave effects that this shocking finding surely will have on the upcoming presidential campaign. This post focuses on the elegant technical machinery used to reach the unsettling conclusion.

The crux of the analysis applies, in a statistical setting, modus tollens, a basic inference rule of logic. Given two facts $X$ and $Y$ such that if $X$ is true then $Y$ is true, modus tollens derives the falsehood of $X$ from the falsehood of $Y$. In formal notation:

$\begin{matrix} X \Longrightarrow Y, \quad \neg Y \\ \hline \neg X \end{matrix}$

For example, take $X$ to be “It rains” and $Y$ to be “I have an umbrella with me”. From the fact that I am carrying no umbrella, by applying modus tollens, you can conclude that it’s not raining.

The next step introduces a simple generalization of modus tollens to the case where facts are true with some probability: if $X$ is true then $Y$ is true with high probability. Then, when $Y$ happens to be false, we conclude that $X$ is unlikely to be true. If I have an umbrella with me 99% of the times when it rains, there’s only a 1% chance that it rains if I have no umbrella with me.

All this is plain and simple, but it has surprising consequence when applied to the presidential case. A randomly sampled American citizen is quite unlikely to be the President; the odds are just 1 in 321-something millions. So we have that if “a person $p$ is American” (or $X$) is true then “$p$ is not the President” (or $Y$) is true with high probability. But Mr. Barack Obama happens to be the President, so he’s overwhelmingly unlikely to be American according to probabilistic modus tollens!

(The ironic part of the post ends here.)

Sure you’re thinking that this was a poor attempt at a joke. I would agree, were it not the case that the very same unsound inference rule is being applied willy-nilly in countless scientific papers in the form of statistical hypothesis testing. The basic statistical machinery, which I’ve discussed in a previous post, tells us that, under a null hypothesis $H_0$, a certain data $D$ is unlikely to happen. In other words: if “the null hypothesis $H_0$” is true then “the data is different than $D$” is true with high probability. So far so good. But then the way this fact is used in practice is the following: if we observe the unlikely $D$ in our experiments, we conclude that the null hypothesis is unlikely, and hence we reject it — unsoundly! How’s that for a joke?

Having seen for ourselves that modus tollens does not generalize to probabilistic inference, what is a correct inference from data to hypothesis testing? We can use Bayes’s theorem and phrase it in terms of conditional probabilities. $P(X \mid Y)$ is the probability that $X$ occurs given that $Y$ has occurred. Then $P(H_0 \mid D)$ — the probability that the null hypothesis $H_0$ is true given that we observed data $D$ — is computed as $P(D \mid H_0) \cdot P(H_0) / P(D)$. Even if we know that $D$ is unlikely under the null hypothesis — $P(D \mid H_0)$ is small — we cannot dismiss the null hypothesis with confidence unless we know something about the absolute prior probabilities of $H_0$ and $D$. To convince ourselves that Bayes’s rule leads to sound inference, we can apply it to the Barack Obama case: $H_0$ is “a person $p$ is American” and $D$ is “$p$ is the President”. We plug the numbers in and do the simple math to see that $P(H_0 \mid D)$, the probability that the President is American, is indeed one:

$(1 / A) \cdot (A / W) / (1 / W) = 1$, where $A$ is the population of the USA and $W$ is the world population. Bayes 1 – birthers 0.

Now you understand the fuss about statistical hypothesis testing that has emerged in numerous experimental sciences. Sadly, this blunder is not merely a possibility; it is quite likely that it has affected the validity of numerous published experimental “findings”. In fact, the inadequacy of statistical hypothesis testing is compounded by other statistical results such as the arbitrariness of a hard-and-fast confidence threshold, the false hypothesis paradox (when studying a rare phenomenon, that is a phenomenon with low base rates, most positive results are false positives), and self-selection (the few research efforts that detect some rare phenomenon will publish, whereas the overwhelming majority of “no effect” studies will not lead to publication). In an era of big data, these behaviors are only becoming more likely to emerge.

The take home message is simple yet important. Statistical hypothesis testing is insufficient, by itself, to derive sound conclusions about empirical observations. It must be complemented by other analysis techniques, such as data visualization, effect sizes, confidence intervals, and Bayesian analysis. Unless you remain convinced that Obama’s not American, Elvis is alive, and the Apollo moon landings were staged. In this case, this blog is not for you — with high probability.

How do you determine, empirically, if a theory is correct?

A standard approach uses the toolset of statistical hypothesis testing, whose modern formulation was introduced by influential figures such as Fisher and Pearson around a hundred years ago. The first step consists in expressing the absence of an effect, the one we’re investigating empirically, as a null hypothesis. Then, a mathematical toolset exists that assesses the likelihood that an observed sample (the empirical data) is drawn under the assumption that the null hypothesis holds. Likelihood is normally given as a $p$-value — the probability of the observation assuming the null hypothesis. A small $p$-value (say, $p < 0.01[/latex]) means that the observation would be very unlikely to occur under the null hypothesis. This is compelling evidence against the hypothesis, which can be rejected with confidence: evidence suggests the effect is real. To make the discussion concrete, let me introduce an example taken from my own research. Martin Nordio conceived the idea of assessing the impact of using agile development processes (such as Scrum) on the success of software projects carried out by distributed teams. With Christian Estler, he set up a broad data collection effort where they queried information about more than sixty software projects worldwide. The overall goal of the study was to collect empirical evidence that following agile processes, as opposed to more traditional development processes, does make a difference. To this end, the null hypothesis was roughly:
There is no difference in [latex]X$
for projects developed following agile processes compared to projects developed using other processes.

We instantiated $X$ for the several aspects of project outcome gathered in the data collection, including "overall success", team "motivation" and "amount of communication", and "importance" for customers. This gave us not one but six null hypotheses (extended to sixteen in the follow-up version of the work) to be rejected by the data.

Much to our dismay, when we sat down to analyze the data we found no evidence whatsoever to disprove any of the hypotheses. The $p$-values were in the range $0.25..0.97$ — even greater than $0.75$ for four of the hypotheses! Was the collected data any good?

More generally, what can one conclude from lack of evidence against the null hypothesis, to wit large $p$-values? In the orthodox Fisherian interpretation: nothing except that the available data is inconclusive. In Fisher's own words [Fisher, 1935, emphasis added]:

the null hypothesis is never proved or established [...] every experiment [...] exist[s] only in order to give the facts a chance of disproving the null hypothesis

Such an approach, were facts are assessed only by disproving hypotheses, resonates with Popper's notion of falsifiability, where only theories that can be conclusively rejected by experiments are considered scientific. The closeness of dates when Fisher's and Popper's works became popular is an inkling that the similarity may not be entirely coincidental.

And yet, the reality of science is a bit more nuanced than that. Regarding falsifiability, it is generally false (ahem 🙂 ) that empirical falsifiability is a sine qua non for a theory to be scientific. As Sean Carroll nicely explains, the most theoretical branches of physics routinely deal with models that are not testable (at least as of yet) but nonetheless fit squarely into the practice of science as structured interpretation of data. Science is as scientist does.

But even for theories that are well within the canon of testability and falsifiability, there are situations where the lack of evidence against a hypothesis is indicative of the hypothesis being valid. To find these, we should look beyond individual experiments, to the bigger picture of accumulating evidence. As experiments searching for an effect are improved and repeated on larger samples, if the effect is genuinely present it should manifest itself more strongly as the experimental methodology sharpens. In contrast, if significance keeps waning, it may well be a sign that the effect is absent. In such contexts, persistent lack of significance may be even more informative than its occasional presence. Feynman made similar observations about what he called "cargo cult science" [Feynman, 1985], where effects becoming smaller as experimental techniques improve is a sure sign of the effects being immaterial.

We drew conclusions along these lines in the study of agile processes. Ours was not the first empirical analysis of the effect of agile practices. Overall, the other studies provided mixed, sometimes conflicting, evidence in favor of agile processes. Our study stood out in terms of size of the samples (most other studies target only a handful of projects) and variety of tested hypotheses. In this context, the fact that the data completely failed to disprove any of the hypothesis was telling us more than just "try again". It suggested that agile processes have no definite absolute edge over traditional processes; they can, and often do, work very well, but so can traditional processes if applied competently.

These results, presented at ICGSE 2012 in Brazil and later in an extended article in Empirical Software Engineering, stirred some controversy — Martin Nordio's presentation, where he posited Maradona a better footballer than Pelé to an audience with a majority of Brazilian people, may have contributed some to the stirring. Enthusiasts of agile processes disputed our conclusions, reporting direct experiences with successful projects that followed agile processes. Our study should not, however, be interpreted as a dismissal of agile processes. Instead, it suggests that agile and other processes can be equally effective for distributed software development. But neither are likely to be silver bullets whose mere application, independent of the characteristics and skills of teams, guarantees success. Overall, the paper was generally well received and it went on to win the best paper award of ICGSE 2012.

There are two bottom lines to this post's story. First: ironically, statistics seems to suggest that Martin might have been wrong about Maradona. Second: when your $p$-values are too large don't throw them away just yet! They may well be indirectly contributing to sustaining scientific evidence.

#### References

1. Ronald A. Fisher: The design of experiments, 1935. Most recent edition: 1971, Macmillan Pub Co.
2. Richard Feynman, Ralph Leighton, contributor, Edward Hutchings, editor: Surely You're Joking, Mr. Feynman!: Adventures of a Curious Character, 1985. The chapter "Cargo cult science" is available online.