“Welcome to the first class of our course on formal verification. I’d like to start with some motivation: why verification is important in the real world, and what the consequences of lack of verification in software development are.” Lights go off. You hear a countdown recited in French, “Trois, deux, unité, feu!“. On the screen, a rocket blasts off its platform into the sky. Thirty-something seconds into its ascension, it awkwardly topples off and quickly disintegrates. “That was the Ariane 5 rocket’s first launch, which failed due to a software bug. The failure resulted in an estimated 300 million Euros loss. Software errors can even cost human lives, such as in the case of the Therac-25 machine for radiation therapy, whose malfunctioning resulted in several patients dying from radiation burns. Formal verification can ensure that software is free from defects before it’s deployed, thus avoiding catastrophic disasters such as those just mentioned. Now here’s the course syllabus…”

OK, I admit to slightly exaggerating the above for dramatic purposes. But, if you ever took a course on formal verification, chances are that it started with some graphic pictures of exploding rockets, unresponsive Mars rovers, or defective medical equipment, followed by stark tallies of the enormous losses, in money or human lives, resulting from the disasters. The bottom line: software quality is generally abysmal; and the systematic all-out application of formal verification is our only hope to avert further similar disasters from occurring in the future.

What’s the problem with this angle, to which I may have also been guilty of paying lip service in the past? I think this way of pitching formal verification points to motivations that are not fairly representative of the varied landscape in the research and practice of formal methods, and risks alienating software development practitioners, who have different concerns in their everyday’s work — concerns which conflict with the oversimplified picture painted above.

The first problem I see is that pointing to a handful of major disasters due to software failures may be construed as suggesting that formal verification is something only justified for mission critical applications. What about the myriad software projects that don’t cost hundreds of million dollars and whose failure does not result in ruinous losses? Is there a place for formal methods within such projects’ modest budgets? Related to these points is the misled vision that formal verification is an all-or-nothing business: either it enters every step of development or you might as well do things as you’ve always done them — but then disaster will impend. The flip side is that the only failures that matter are the catastrophic ones; everything else is not a big deal. In reality, everyday software developers generally remain genuinely concerned about software quality even when working on non mission-critical development; but also have to cope with all sorts of project constraints pulling in different directions.

A better way to frame the usage of formal verification is in terms of a continuum of trade-offs. Each trade-off balances some of the additional costs incurred by applying verification techniques against some gains resulting from higher reliability or confidence in the software’s correctness. This way, each project can pick the trade-off that best fits its requirements and budget. It’s easy to guess that the overwhelming majority of development projects occupy some intermediate position in the trade-off continuum: their failures won’t result in apocalyptic losses, nor are they of no consequence whatsoever. These are the projects that most practitioner developers are familiar with and can relate to. To each their own: Paris Métro Line 14 deployed the B method to build correct-by-construction modules through refinement; for my students learning Java and C#, writing a decent collection of unit tests and running a lint-like tool on their projects is good enough.

The second problem with the above pitching of formal methods is that it may suggest that they are a silver bullet. You want high reliability: just use formal methods (but it will cost you)! It doesn’t matter what your development process is, what programming language you’re using, whether your software is interactive, if you’re deploying in a legacy system, what your amount of technical debt is. Start using [your favorite verification technique] and your problems will evaporate!

But reliability, and practically every desirable quality property, is never the result of a single technique, tool, design choice, or individual talent. It can only follow from a sensible, well-organized process; one that is transparent, traceable, technologically up to date, based on a clear understanding of the application domain, and so on. The investigations following the failures of the examples mentioned above (Ariane 5, Therac-25, etc.) invariably pointed to communication and process shortcoming as well as, often more egregious than, technical ones. Correctness cannot be built as an afterthought; it requires planning and practice. Based on his experience at Microsoft, James Larus claimed that a root cause of software bugs is “missing, outdated, and incorrect knowledge“, which easily result from poorly organized processes. Conversely, the practical adoption of formal verification is very often slowed down by unintended process constraints; progress in this sense doesn’t necessarily require more powerful techniques but mainly integrating known techniques in a way that is not disruptive of existing practices — perhaps except for unfixable noxious practices whose disruption is desirable anyway.

The third problem I see with the “disaster-oriented” motivation for formal methods is that it seems to suggest that verification research is worthy only to the extent that it’s directly applicable to relevant software development issues and leads concrete gains, for example averting failure. This view encourages a view of science as a provider of cost-effective technology. To be sure, verification technology has had enough success stories that even long-standing skeptics must concede it can be useful in practice. (If you find the previous sentence controversial, I have blog posts planned to convince you!) But practical applicability need not be — should not be — the only goal of researchers. There must be room for speculative inquiries, also because long term progress invariably and crucially depends on them. Think, for example, of how the spectacular progress of SAT solvers has fueled advances in automatic verification in the last decade or so. The performance of SAT solvers hasn’t mushroomed overnight, but is the result of decades of research on a topic that seemed, at the time of its inception, quite removed from practicality. As Ed Clarke often recalls when presenting SAT-based model checking, if he had suggested using an NP-complete problem such as SAT as the basis for a practical verification technique at the time he was working on his Ph.D. thesis, his colleagues would have called his suggestion implausible because clearly clueless about the well-understood notion of intractable problem.

In spite of not always being pitched and motivated in the most convincing way, formal verification remains an exciting, quickly growing research topic that has made spectacular progress, which has been winning, in one form or another, the favors of more and more serious practitioners. One example among several: John Carmack‘s recent opinion that

The most important thing I have done as a programmer in recent years is to aggressively pursue static code analysis. Even more valuable than the hundreds of serious bugs I have prevented with it is the change in mindset about the way I view software reliability and code quality.

This is quite an encouraging endorsement, even if you work on verification techniques other than static analysis. Or perhaps all the “motivation” that verification research needs comes from Leibniz. In the next formal methods course I teach, I’ll try to open with his famous quote instead of discussing rockets and radiation machines:

The only way to rectify our reasonings is to make them as tangible as those of the mathematicians, so that we can find our error at a glance, and when there are disputes among persons, we can simply say: let us calculate without further ado to see who is right.

How do you determine, empirically, if a theory is correct?

A standard approach uses the toolset of statistical hypothesis testing, whose modern formulation was introduced by influential figures such as Fisher and Pearson around a hundred years ago. The first step consists in expressing the absence of an effect, the one we’re investigating empirically, as a null hypothesis. Then, a mathematical toolset exists that assesses the likelihood that an observed sample (the empirical data) is drawn under the assumption that the null hypothesis holds. Likelihood is normally given as a p-value — the probability of the observation assuming the null hypothesis. A small p-value (say, p < 0.01[/latex]) means that the observation would be very unlikely to occur under the null hypothesis. This is compelling evidence <em>against</em> the hypothesis, which can be <em>rejected</em> with confidence: evidence suggests the effect is real.       To make the discussion concrete, let me introduce an example taken from my own research. <a href="http://se.inf.ethz.ch/people/nordio/">Martin Nordio</a> conceived the idea of assessing the impact of using <a href="http://en.wikipedia.org/wiki/Agile_software_development">agile</a> development processes (such as <a href="http://en.wikipedia.org/wiki/Scrum_%28software_development%29">Scrum</a>) on the success of software projects carried out by distributed teams. With <a href="http://se.inf.ethz.ch/people/estler/">Christian Estler</a>, he set up a broad data collection effort where they queried information about more than sixty software projects worldwide. The overall goal of the study was to collect empirical evidence that following agile processes, as opposed to more traditional development processes, <em>does make</em> a difference. To this end, the null hypothesis was roughly:   <blockquote>    There is <b>no</b> difference in [latex]X for projects developed following agile processes compared to projects developed using other processes.

We instantiated X for the several aspects of project outcome gathered in the data collection, including “overall success”, team “motivation” and “amount of communication”, and “importance” for customers. This gave us not one but six null hypotheses (extended to sixteen in the follow-up version of the work) to be rejected by the data.

Much to our dismay, when we sat down to analyze the data we found no evidence whatsoever to disprove any of the hypotheses. The p-values were in the range 0.25..0.97 — even greater than 0.75 for four of the hypotheses! Was the collected data any good?

More generally, what can one conclude from lack of evidence against the null hypothesis, to wit large p-values? In the orthodox Fisherian interpretation: nothing except that the available data is inconclusive. In Fisher’s own words [Fisher, 1935, emphasis added]:

the null hypothesis is never proved or established […] every experiment […] exist[s] only in order to give the facts a chance of disproving the null hypothesis

Such an approach, were facts are assessed only by disproving hypotheses, resonates with Popper’s notion of falsifiability, where only theories that can be conclusively rejected by experiments are considered scientific. The closeness of dates when Fisher’s and Popper’s works became popular is an inkling that the similarity may not be entirely coincidental.

And yet, the reality of science is a bit more nuanced than that. Regarding falsifiability, it is generally false (ahem 🙂 ) that empirical falsifiability is a sine qua non for a theory to be scientific. As Sean Carroll nicely explains, the most theoretical branches of physics routinely deal with models that are not testable (at least as of yet) but nonetheless fit squarely into the practice of science as structured interpretation of data. Science is as scientist does.

But even for theories that are well within the canon of testability and falsifiability, there are situations where the lack of evidence against a hypothesis is indicative of the hypothesis being valid. To find these, we should look beyond individual experiments, to the bigger picture of accumulating evidence. As experiments searching for an effect are improved and repeated on larger samples, if the effect is genuinely present it should manifest itself more strongly as the experimental methodology sharpens. In contrast, if significance keeps waning, it may well be a sign that the effect is absent. In such contexts, persistent lack of significance may be even more informative than its occasional presence. Feynman made similar observations about what he called “cargo cult science” [Feynman, 1985], where effects becoming smaller as experimental techniques improve is a sure sign of the effects being immaterial.

We drew conclusions along these lines in the study of agile processes. Ours was not the first empirical analysis of the effect of agile practices. Overall, the other studies provided mixed, sometimes conflicting, evidence in favor of agile processes. Our study stood out in terms of size of the samples (most other studies target only a handful of projects) and variety of tested hypotheses. In this context, the fact that the data completely failed to disprove any of the hypothesis was telling us more than just “try again”. It suggested that agile processes have no definite absolute edge over traditional processes; they can, and often do, work very well, but so can traditional processes if applied competently.

These results, presented at ICGSE 2012 in Brazil and later in an extended article in Empirical Software Engineering, stirred some controversy — Martin Nordio’s presentation, where he posited Maradona a better footballer than Pelé to an audience with a majority of Brazilian people, may have contributed some to the stirring. Enthusiasts of agile processes disputed our conclusions, reporting direct experiences with successful projects that followed agile processes. Our study should not, however, be interpreted as a dismissal of agile processes. Instead, it suggests that agile and other processes can be equally effective for distributed software development. But neither are likely to be silver bullets whose mere application, independent of the characteristics and skills of teams, guarantees success. Overall, the paper was generally well received and it went on to win the best paper award of ICGSE 2012.

There are two bottom lines to this post’s story. First: ironically, statistics seems to suggest that Martin might have been wrong about Maradona. Second: when your p-values are too large don’t throw them away just yet! They may well be indirectly contributing to sustaining scientific evidence.

References

  1. Ronald A. Fisher: The design of experiments, 1935. Most recent edition: 1971, Macmillan Pub Co.
  2. Richard Feynman, Ralph Leighton, contributor, Edward Hutchings, editor: Surely You’re Joking, Mr. Feynman!: Adventures of a Curious Character, 1985. The chapter “Cargo cult science” is available online.