How do you determine, empirically, if a theory is correct?

A standard approach uses the toolset of statistical hypothesis testing, whose modern formulation was introduced by influential figures such as Fisher and Pearson around a hundred years ago. The first step consists in expressing the absence of an effect, the one we’re investigating empirically, as a null hypothesis. Then, a mathematical toolset exists that assesses the likelihood that an observed sample (the empirical data) is drawn under the assumption that the null hypothesis holds. Likelihood is normally given as a $p$-value — the probability of the observation assuming the null hypothesis. A small $p$-value (say, $p < 0.01[/latex]) means that the observation would be very unlikely to occur under the null hypothesis. This is compelling evidence against the hypothesis, which can be rejected with confidence: evidence suggests the effect is real. To make the discussion concrete, let me introduce an example taken from my own research. Martin Nordio conceived the idea of assessing the impact of using agile development processes (such as Scrum) on the success of software projects carried out by distributed teams. With Christian Estler, he set up a broad data collection effort where they queried information about more than sixty software projects worldwide. The overall goal of the study was to collect empirical evidence that following agile processes, as opposed to more traditional development processes, does make a difference. To this end, the null hypothesis was roughly:
There is no difference in [latex]X$
for projects developed following agile processes compared to projects developed using other processes.

We instantiated $X$ for the several aspects of project outcome gathered in the data collection, including "overall success", team "motivation" and "amount of communication", and "importance" for customers. This gave us not one but six null hypotheses (extended to sixteen in the follow-up version of the work) to be rejected by the data.

Much to our dismay, when we sat down to analyze the data we found no evidence whatsoever to disprove any of the hypotheses. The $p$-values were in the range $0.25..0.97$ — even greater than $0.75$ for four of the hypotheses! Was the collected data any good?

More generally, what can one conclude from lack of evidence against the null hypothesis, to wit large $p$-values? In the orthodox Fisherian interpretation: nothing except that the available data is inconclusive. In Fisher's own words [Fisher, 1935, emphasis added]:

the null hypothesis is never proved or established [...] every experiment [...] exist[s] only in order to give the facts a chance of disproving the null hypothesis

Such an approach, were facts are assessed only by disproving hypotheses, resonates with Popper's notion of falsifiability, where only theories that can be conclusively rejected by experiments are considered scientific. The closeness of dates when Fisher's and Popper's works became popular is an inkling that the similarity may not be entirely coincidental.

And yet, the reality of science is a bit more nuanced than that. Regarding falsifiability, it is generally false (ahem ðŸ™‚ ) that empirical falsifiability is a sine qua non for a theory to be scientific. As Sean Carroll nicely explains, the most theoretical branches of physics routinely deal with models that are not testable (at least as of yet) but nonetheless fit squarely into the practice of science as structured interpretation of data. Science is as scientist does.

But even for theories that are well within the canon of testability and falsifiability, there are situations where the lack of evidence against a hypothesis is indicative of the hypothesis being valid. To find these, we should look beyond individual experiments, to the bigger picture of accumulating evidence. As experiments searching for an effect are improved and repeated on larger samples, if the effect is genuinely present it should manifest itself more strongly as the experimental methodology sharpens. In contrast, if significance keeps waning, it may well be a sign that the effect is absent. In such contexts, persistent lack of significance may be even more informative than its occasional presence. Feynman made similar observations about what he called "cargo cult science" [Feynman, 1985], where effects becoming smaller as experimental techniques improve is a sure sign of the effects being immaterial.

We drew conclusions along these lines in the study of agile processes. Ours was not the first empirical analysis of the effect of agile practices. Overall, the other studies provided mixed, sometimes conflicting, evidence in favor of agile processes. Our study stood out in terms of size of the samples (most other studies target only a handful of projects) and variety of tested hypotheses. In this context, the fact that the data completely failed to disprove any of the hypothesis was telling us more than just "try again". It suggested that agile processes have no definite absolute edge over traditional processes; they can, and often do, work very well, but so can traditional processes if applied competently.

These results, presented at ICGSE 2012 in Brazil and later in an extended article in Empirical Software Engineering, stirred some controversy — Martin Nordio's presentation, where he posited Maradona a better footballer than Pelé to an audience with a majority of Brazilian people, may have contributed some to the stirring. Enthusiasts of agile processes disputed our conclusions, reporting direct experiences with successful projects that followed agile processes. Our study should not, however, be interpreted as a dismissal of agile processes. Instead, it suggests that agile and other processes can be equally effective for distributed software development. But neither are likely to be silver bullets whose mere application, independent of the characteristics and skills of teams, guarantees success. Overall, the paper was generally well received and it went on to win the best paper award of ICGSE 2012.

There are two bottom lines to this post's story. First: ironically, statistics seems to suggest that Martin might have been wrong about Maradona. Second: when your $p$-values are too large don't throw them away just yet! They may well be indirectly contributing to sustaining scientific evidence.

#### References

1. Ronald A. Fisher: The design of experiments, 1935. Most recent edition: 1971, Macmillan Pub Co.
2. Richard Feynman, Ralph Leighton, contributor, Edward Hutchings, editor: Surely You're Joking, Mr. Feynman!: Adventures of a Curious Character, 1985. The chapter "Cargo cult science" is available online.