Do you trust nuclear energy? I know, it’s a complex question — not the kind that easily leads to a “yes” or “no” answer. Let me rephrase it into a possibly less controversial form: do you trust nuclear engineers? Now, the odds are turning, and there’s a good chance Joe or Jane Sixpack would answer in the affirmative, without feeling the need to add qualifications to their answer.

Indeed, for all the negative stereotyping about being boring or uncreative, engineers have a generally solid reputation among the general public of being trustworthy and competent professionals. Well, I mean, traditional engineers have. Those who build bridges, buildings, and power plants. Those are the guys you can trust. Ah, if only software development were a solid engineering discipline in the same league with civil, mechanical, and nuclear engineering! We wouldn’t have to deal with lousy apps or insecure systems anymore! We could enjoy the same level of reliability and safety of, say, nuclear power which, in the immortal words of Homer J. Simpson, “has yet to cause a single proven fatality.”

Seriously, I don’t want to compare apples and oranges, but it is true that software developers may enjoy a generally poorer reputation than other professionals. The latest outcry follows an all too familiar argument: 1. Software is spectacularly unreliable; 2. Traditional engineering delivers highly reliable products; 3. Ergo, software development should adopt engineering practices (or, as the article goes, stop calling itself software engineering).

I will not pick on that particular article, even though it’d be a particularly easy target given its technical shallowness and shaky arguments (how is the dieselgate an indictment of software’s low quality?). Instead I’m interested in the more general issue of what traditional engineering has to teach to software engineering. I also want to avoid going down the line of nominalism — which all too often emerges in debating these kinds of topics — looking for the “essence” of engineering, software development, or the like. That’d be neither useful nor particularly interesting.

First of all, I believe that the gap of reliability between the best software systems and the best artifacts produced by traditional engineering is much smaller than the common opinion would lead us to believe. In fact, the line between the two is blurry, in that some of the most reliable complex software lies at the core of airplanes, cars, or control plants. If the latter are reliable, it is largely because the former is. As another, historical but still very significant, example see Feynman’s report on the Challenger disaster: among the various engineering divisions, the one responsible for avionics software is the only one whose practices pass with flying colors: “the computer software checking system and attitude is of the highest quality.” Of course most software development does not come even close to complying with the standards of quality of avionics software; but neither is the engineering of the majority of traditional consumer products with no mission-critical requirements.

Another feature traditionally ascribed to traditional engineering as opposed to software engineering is a rigorous, systematic approach that does not leave anything to chance. A nuclear engineer, for example, knows exactly what goes on in every stage of development and deployment of a nuclear power plant, and knows how to ensure that no accidents occur. Except this is not exactly true. It turns out that [Mahaffey, 2015] the riskiest activity related to harnessing nuclear power is fuel (re)processing, where fissile fuel is produced by chemical means. The major risk is that some of the fuel becomes critical, that is capable of sustaining a spontaneous chain reaction. Criticality does not only depend on the amount of fuel but also on its geometric configuration. It is practically impossible to predict every possible configuration the fuel may take while being processed, and in fact several accidents occurred due to unexpected accumulation of material in what turned out to be criticality-inducing configurations. This does not mean that the whole enterprise is left to chance, only that perfect planning is unattainable and one must deal flexibly with uncertainties. The best practices are to ensure that the operators are thoroughly trained not only in the procedures but are also fully aware of the general issues that may arise and on the lookout for unpredicted sources of danger. This attitude towards prevention and awareness was first adopted in the fuel processing plants of the Manhattan Project, when a young Feynman (him again) pointed out to his superiors that the excessive secrecy initially imposed, which prevented workers from knowing what exactly they were doing and what risks they could encounter, was counterproductive and foolishly dangerous. Prevention is the best protection, and prevention requires knowledgeable people, not drones.

Which leads us to the other main point: what can software engineering really learn from traditional engineering — and nuclear engineering in particular? It’s not the scientific foundations: informatics provides rock-solid foundations. It’s not so much the institutionalization as a licensed profession: while it may be useful in certain contexts (for example for freelance software developers) it’d have little or no relevance in others. The attitude that software engineering can learn from nuclear engineering is what to in the aftermath of an accident.

When something goes wrong in a nuclear reactor and it gets scrammed, it is imperative that the dynamics of the accident be understood down to minute details. This involves understanding the physics of the reaction, any failure of the equipment against its supposed behavior, how the personnel reacted, what practices were followed up to the time of the accident that may have altered operating conditions or generally increased risks. Based on an exhaustive post mortem, procedures, technologies, equipment, and training are revised. These activities have top priority, and must be completed to satisfaction before operations can restart. The net effect is that the chances of the same kind of problem occurring twice are minimal, and reliability improves by building on an ever growing knowledge base. (Bertrand Meyer made similar remarks about aerospace engineering.)

Two final comments. First, I believe such kind of practices are already in place, in some form at least, in the best software development environments. Besides safety-critical software, where we have several well-known case studies, the practice of large-scale software development includes inspections, regressions, and rigorous analysis. Even Facebook has given up on their “move fast and break things”, and realized the importance of stable infrastructure and rigorous analysis. Second, engineering is ultimately all about trade-offs. If you’re developing a silly app, you may just accept that “mediocre” is good enough. And that what you’re doing is not quite “engineering”.

Reference

  1. James A. Mahaffey: Atomic Accidents: A History of Nuclear Meltdowns and Disasters: From the Ozark Mountains to Fukushima. Pegasus, 2015.

It was the best of times, it was the worst of times, it was the age of fastness, it was the age of sluggishness, it was the epoch of abstraction, it was the epoch of detail, it was the season of generality, it was the season of specificity, it was the spring of right, it was the winter of wrong, we had improving performance before us, we had degrading performance before us, we were all going direct to where Moore predicted, we were all going direct the other way — in short, the period was so far like the present period, that some of its most influential experts insisted on its being received, for good or for evil, in the superlative degree of comparison only.

Sometimes I have the impression we’re experiencing a collective delusion regarding how we’re dealing with the end of Moore’s law — the prediction that the density of transistors in integrated circuits grows exponentially over time.

The growth predicted by Moore’s law has gone on for decades, providing a veritable, Lucullan free lunch to the entire information technology industry. But now this bonanza is running out: the density growth is slowing down, and improvements in miniaturization do not directly translate into faster clock rates.

That exponential growth is not sustainable in the long term should come as no surprise to any trained engineer. In Moore’s words: The nature of exponentials is that you push them out and eventually disaster happens. Or, using Linus Torvald’s unmistakably caustic style (talking about a different exponential growth): Unending exponential growth? What drugs are those people on? I mean, really.

Rather than coming to terms with the harsh reality — the party is over — we prefer to sugar the pill and tell a less dramatic story. Sure, the clock rate of microprocessors has not increased significantly in a while, but the progress of electronics technology is still spectacular, and is giving us chips with more and more computational units crammed onto the same circuit. So, the updated story goes, the new challenge is exploiting the increasingly parallel hardware by writing increasingly concurrent software. Of course, concurrent software is notoriously hard to get right, but software technology is up to the challenge. More of Moore through concurrency.

But can concurrent programming really let us keep on riding the exponential speedup wave? There’s no question that improving the state of concurrent programming is a fascinating research challenge, both intellectually and practically. But I’m afraid the end of Moore’s law is more disruptive than we are willing to admit. And its long-term consequences cannot be countered simply by better programming.

The elephant in the room is that we know of only a handful of algorithms and programs that can be massively parallelized. In most cases, there is a significant fraction of the computation that is intrinsically sequential, which drastically limits, per Amdahl’s law, the speedup that can be obtained by adding computing cores. You can have the best concurrency model in the world, a flawless implementation, and as many cores as you like, but still be nowhere near anything like exponential speedup (in fact, not even linear speedup!).

Here’s what some Don Knuth said about it in a 2008 interview that should be quoted more often.

To me, it looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks! […]

Let me put it this way: During the past 50 years, I’ve written well over a thousand programs, many of which have substantial size. I can’t think of even five of those programs that would have been enhanced noticeably by parallelism or multithreading. […]

[Hardware vendors] think a magic bullet will come along to make multicores speed up my kind of work; I think it’s a pipe dream. […]

From the opposite point of view, I do grant that web browsing probably will get better with multicores. I’ve been talking about my technical work, however, not recreation.

I’m inclined to give him the benefit of the doubt and assume he’s right, as I have the impression the guy knows a thing or two about algorithms and programming.

I look forward, by all means, to the challenges of combining abstraction and efficiency in concurrency research. But perhaps we should not be selling concurrent programming as a potential “fix” to the end of Moore’s law. While every exponential growth eventually stops (I mean, really), we can have the more down-to-earth goal of making the most out of the new exotic hardware architectures that are being developed. May we live in interesting times.

“Welcome to the first class of our course on formal verification. I’d like to start with some motivation: why verification is important in the real world, and what the consequences of lack of verification in software development are.” Lights go off. You hear a countdown recited in French, “Trois, deux, unité, feu!“. On the screen, a rocket blasts off its platform into the sky. Thirty-something seconds into its ascension, it awkwardly topples off and quickly disintegrates. “That was the Ariane 5 rocket’s first launch, which failed due to a software bug. The failure resulted in an estimated 300 million Euros loss. Software errors can even cost human lives, such as in the case of the Therac-25 machine for radiation therapy, whose malfunctioning resulted in several patients dying from radiation burns. Formal verification can ensure that software is free from defects before it’s deployed, thus avoiding catastrophic disasters such as those just mentioned. Now here’s the course syllabus…”

OK, I admit to slightly exaggerating the above for dramatic purposes. But, if you ever took a course on formal verification, chances are that it started with some graphic pictures of exploding rockets, unresponsive Mars rovers, or defective medical equipment, followed by stark tallies of the enormous losses, in money or human lives, resulting from the disasters. The bottom line: software quality is generally abysmal; and the systematic all-out application of formal verification is our only hope to avert further similar disasters from occurring in the future.

What’s the problem with this angle, to which I may have also been guilty of paying lip service in the past? I think this way of pitching formal verification points to motivations that are not fairly representative of the varied landscape in the research and practice of formal methods, and risks alienating software development practitioners, who have different concerns in their everyday’s work — concerns which conflict with the oversimplified picture painted above.

The first problem I see is that pointing to a handful of major disasters due to software failures may be construed as suggesting that formal verification is something only justified for mission critical applications. What about the myriad software projects that don’t cost hundreds of million dollars and whose failure does not result in ruinous losses? Is there a place for formal methods within such projects’ modest budgets? Related to these points is the misled vision that formal verification is an all-or-nothing business: either it enters every step of development or you might as well do things as you’ve always done them — but then disaster will impend. The flip side is that the only failures that matter are the catastrophic ones; everything else is not a big deal. In reality, everyday software developers generally remain genuinely concerned about software quality even when working on non mission-critical development; but also have to cope with all sorts of project constraints pulling in different directions.

A better way to frame the usage of formal verification is in terms of a continuum of trade-offs. Each trade-off balances some of the additional costs incurred by applying verification techniques against some gains resulting from higher reliability or confidence in the software’s correctness. This way, each project can pick the trade-off that best fits its requirements and budget. It’s easy to guess that the overwhelming majority of development projects occupy some intermediate position in the trade-off continuum: their failures won’t result in apocalyptic losses, nor are they of no consequence whatsoever. These are the projects that most practitioner developers are familiar with and can relate to. To each their own: Paris Métro Line 14 deployed the B method to build correct-by-construction modules through refinement; for my students learning Java and C#, writing a decent collection of unit tests and running a lint-like tool on their projects is good enough.

The second problem with the above pitching of formal methods is that it may suggest that they are a silver bullet. You want high reliability: just use formal methods (but it will cost you)! It doesn’t matter what your development process is, what programming language you’re using, whether your software is interactive, if you’re deploying in a legacy system, what your amount of technical debt is. Start using [your favorite verification technique] and your problems will evaporate!

But reliability, and practically every desirable quality property, is never the result of a single technique, tool, design choice, or individual talent. It can only follow from a sensible, well-organized process; one that is transparent, traceable, technologically up to date, based on a clear understanding of the application domain, and so on. The investigations following the failures of the examples mentioned above (Ariane 5, Therac-25, etc.) invariably pointed to communication and process shortcoming as well as, often more egregious than, technical ones. Correctness cannot be built as an afterthought; it requires planning and practice. Based on his experience at Microsoft, James Larus claimed that a root cause of software bugs is “missing, outdated, and incorrect knowledge“, which easily result from poorly organized processes. Conversely, the practical adoption of formal verification is very often slowed down by unintended process constraints; progress in this sense doesn’t necessarily require more powerful techniques but mainly integrating known techniques in a way that is not disruptive of existing practices — perhaps except for unfixable noxious practices whose disruption is desirable anyway.

The third problem I see with the “disaster-oriented” motivation for formal methods is that it seems to suggest that verification research is worthy only to the extent that it’s directly applicable to relevant software development issues and leads concrete gains, for example averting failure. This view encourages a view of science as a provider of cost-effective technology. To be sure, verification technology has had enough success stories that even long-standing skeptics must concede it can be useful in practice. (If you find the previous sentence controversial, I have blog posts planned to convince you!) But practical applicability need not be — should not be — the only goal of researchers. There must be room for speculative inquiries, also because long term progress invariably and crucially depends on them. Think, for example, of how the spectacular progress of SAT solvers has fueled advances in automatic verification in the last decade or so. The performance of SAT solvers hasn’t mushroomed overnight, but is the result of decades of research on a topic that seemed, at the time of its inception, quite removed from practicality. As Ed Clarke often recalls when presenting SAT-based model checking, if he had suggested using an NP-complete problem such as SAT as the basis for a practical verification technique at the time he was working on his Ph.D. thesis, his colleagues would have called his suggestion implausible because clearly clueless about the well-understood notion of intractable problem.

In spite of not always being pitched and motivated in the most convincing way, formal verification remains an exciting, quickly growing research topic that has made spectacular progress, which has been winning, in one form or another, the favors of more and more serious practitioners. One example among several: John Carmack‘s recent opinion that

The most important thing I have done as a programmer in recent years is to aggressively pursue static code analysis. Even more valuable than the hundreds of serious bugs I have prevented with it is the change in mindset about the way I view software reliability and code quality.

This is quite an encouraging endorsement, even if you work on verification techniques other than static analysis. Or perhaps all the “motivation” that verification research needs comes from Leibniz. In the next formal methods course I teach, I’ll try to open with his famous quote instead of discussing rockets and radiation machines:

The only way to rectify our reasonings is to make them as tangible as those of the mathematicians, so that we can find our error at a glance, and when there are disputes among persons, we can simply say: let us calculate without further ado to see who is right.

Perl’s motto “There’s more than one way to do it” applies to verification too. The proliferation of formal methods may seem haphazard to greenhorns; the claimed differences between variants of the same technique exaggerated. Granted, not every improvement is a groundbreaking breakthrough (anyway, not most of the time). But the space of solutions to the verification problem is huge and the theoretical limits outstanding; hence the interesting trade-offs are many, each sensible given different constraints. One size does not fit all.

Nadia‘s guest post about dimensions in verification covered the big picture. The follow-up point I’d like to make is that there’s a significant variability also between variants of the same verification technique. Francesco Logozzo‘s post about abstract interpretation triggered these thoughts of mine; in fact, I’m going to build on his example to illustrate my point.

Francesco presents the example of a routine that computes the maximum element in an array, using a for loop that scans the array from its first to its last element, to show that his CodeContracts static checker — based on abstract interpretation — is not limited to dealing with “simple” properties such as nonnullness: it is capable of inferring the complete functional postcondition of max, namely (using a more readable syntax):

\forall k\colon 0 \leq k < \texttt{a.Length} \Longrightarrow \texttt{a}[k] \leq \texttt{Result}[/latex] [latex]\exists k\colon 0 \leq k < \texttt{a.Length} \wedge \texttt{a}[k] = \texttt{Result}[/latex]

This is remarkable, but let's try to modify the example to see where we can go from here. First, encode max as a function that operates on an attribute a by scanning it from last to first. (The following code is C#.)

Function max_downward and the original one max have the very same input/output behavior, and indeed they implement two variants of the same simple algorithm. Ran on max_downward, the CodeContracts checker correctly infers that Result is an element of a (the existentially quantified postcondition above), and that it is greater than or equal to a's last element, but it falls short of the complete functional postcondition (it misses the universally quantified postcondition).

I don't know the details of CodeContracts checker's implementation of an abstract interpreter, and hence I cannot track down precisely the source of this behavior. However, there is no reason to be surprised about this restriction: any sound verification technique edges on intractability; as users, it normally is not difficult to push it until undecidability rears its ugly head and, in this case, spills some precision. If anything, we should be surprised whenever a verification technique works flexibly in different conditions!

To continue our example, we accept to give up some automation in exchange for more flexibility. We turn to deductive verification; more precisely, auto-active verifiers. Rustan Leino and Michał Moskal introduced the term "auto-active" to denote tools that are partly automatic and partly interactive. (By the way, my proposal to call them "inter-matic" has yet to catch on.) They are automatic in that they do not require interactive guidance while they try to verify an input program. But they still indirectly interact with users through annotations, such as pre- and postconditions, invariants, and suggestions such as intermediate assertions, which are an additional burden to write and crucially influence their capabilities and performance.

Even among auto-active verifiers, there is room for considerable variations in terms of flexibility and simplicity. Let's first consider Dafny, a state-of-the-art verifier mainly developed by the SMT whisperer — a.k.a. Rustan Leino. Dafny has a relatively simple input language, with built-in support for specification models such as mathematical sequences, of which we're going to take advantage in our example. Here's a variant of the example that verifies in Dafny: have a glance at the code, and then I'll point out the interesting parts.

Besides the array a, class PSList ("Possibly Sorted" List) includes a Boolean attribute sorted, whose role is storing whether the content of a is sorted. One can imagine a complete implementation were this flag is set whenever the array is processed by a sorting algorithm or is populated by elements in order. The new variant of maximum, called max_ps, takes advantage of this additional piece of information: if sorted is true, it immediately returns the value a[a.Length - 1] which is initially assigned to Result. Indeed, the maximum of a sequence that is sorted in increasing order is its last element. The correctness of max_ps relies on this fact; which we encode, following Dafny's idiomatic approach to representation invariants, as part of predicate Valid: if sorted is true then the content a[..] of a is sorted in increasing order. Valid is supposed to be an invariant property, which methods of class PSList can rely on and have to preserve. In fact, method max_ps lists Valid in its pre- and postcondition; and Dafny uses its clause about sortedness in the proof of branch if sorted { ... }. (In case you're wondering: the assert there is needed to help Dafny convert between a's content and its sequence representation a[..].)

This example worked smoothly thanks to Dafny's transparent syntax, which explicitly represents all salient features of the code and its annotations. But suppose we make the example more realistic (and more complex) by adding max_ps to the interface of a full-fledged general-purpose list data structure. In this case, we may prefer an auto-active tool with more options and control knobs, which gives additional flexibility in exchange for a steeper learning curve. AutoProof is a verifier that tries to offer such a trade off; thus, the last version of our maximum example is in Eiffel — AutoProof's input language.

Class PS_LIST builds its functionalities by extending those of some verified library implementation of a list with all the bells and whistles. In practice, FULL_FLEDGED_LIST could be some list implementation among those in EiffelBase2. A couple of hints should suffice to explain the example's (imperfectly highlighted) syntax: across expressions represent universal (across ... all) and existential (across ... some) quantification over integer intervals x |..| y; sequence is an inherited model attribute that describes the list's content as a mathematical sequence (indexed from 1, unlike the C# and Dafny examples).

We encode the sortedness property as part of the class invariant, in a clause named when_sorted, which appears at the bottom of PS_LIST's text. Remember that we are inheriting all of the functionalities of a full-fledged list, including a detailed, complex class invariant to which we add clause when_sorted. Dealing with the kind of full functional specifications that are needed to reason about realistic implementations has some practical consequences: since the invariant is a lot of specification, carrying it around in the proof of every routine of the class is likely to bog down the verifier and to render the proof of even trivial properties a challenge. AutoProof, however, has been developed with these situations in mind. Routine max_ps can recall the only invariant element that is needed precisely where it is needed in the proof, using the assertion check inv_only ("when_sorted") end in the conditional branch if sorted then ... . The many other consistency properties inherited with the invariant are simply ignored to the benefit of a straightforward, agile proof.

We've had a glance at two auto-active verifiers of functional correctness to get an idea of how they target different sweet spots in the sweeping landscape of verification techniques and their implementations. Stay tuned for more posts about AutoProof. In the meanwhile, you can check out its project page or try it out in your web browser. AutoProof's main developers, Nadia Polikarpova and Julian Tschannen, welcome feedback about the tool — especially in the form of comments to this post! 🙂

I’m very glad to host a guest post by Nadia Polikarpova, a recent PhD graduate of our group on her way to a post-doctoral position at MIT. Nadia is definitely into bug counting, even though her ultimate ambitions are more of exterminating them (for the casual reader: we’re talking about software bugs — no animals were harmed in the making of her research). Thanks to this guest post of hers, the blog finally goes back to its core interests!

      — Carlo A.

Dimensions in program verification

Model checking, abstract interpretation, program logics, symbolic execution — if you are new to program verification, the variety of techniques it offers might be intimidating. Each area seems to start from a different set of basic principles and use its own terminology, which makes classification and comparison difficult.

To make matters worse, verification experts are often partisans of their school of thought: for an example, see a slide from a recent talk by Patrick Cousot. Abstract interpretation devotees may look down on other approaches, such as auto-active verification, that do not easily scale to systems with million lines of code. Auto-active verification researches, like myself, may make fun of abstract interpretation folks for being stuck with array-out-of-bounds errors, while we can prove full functional correctness (which, in turn, is uninteresting to model-checking experts, who care about temporal properties). The acolytes of the dark art of interactive theorem proving (such as Coq and Isabelle) don’t take any other approach seriously unless it comes with an air-tight proof of soundness. Meanwhile testing people consider all sound techniques pure intellectual exercise, since real programmers are not going to put up with false positives…

How do you make sense of all this and learn to appreciate relative merits and fundamental limitations of different techniques? Here is my short guide to program verification, which I worked out after studying the area for nearly six years.

There are three dimensions to a program verification technique: soundness, flexibility, and automation. Soundness (level of assurance) determines how certain we are that a program is correct if verification succeeds. Flexibility (expressiveness) measures the diversity of programs and correctness properties that the approach can handle. Automation accounts for the amount of human effort — beyond writing the program — that is needed to carry out verification. As is often the case, you can pick any two, but you cannot have all three.

The verification cube

For example, abstract interpretation is sound and fully automatic, but the kind of properties and programs it can check effectively is limited by a predefined set of abstract domains encoded in the analyzer. Bounded model checking, on the other hand, is flexible and fairly automatic — users still have to formalize requirements — but in general cannot provide a correctness guarantee for all inputs (note that we can also trade flexibility for soundness by limiting ourselves to analyzing programs with a finite state space). Deductive verification techniques are flexible and sound, but require a fair amount of manual guidance; among deductive tools, interactive proof assistants are pushing the first two dimensions to the extreme (with higher-order specification languages and tiny trusted cores), while auto-active tools are prepared to sacrifice a bit of both to retain some automation. Unbounded (CEGAR-based) model checking is sound, automatic, and inflexible; symbolic execution is flexible, automatic, and unsound.

I’m not aware of any technique that doesn’t lie close to one of those three corners of the “verification cube”. Why is this the case? Intuitively, the reason is simple: any program analysis can only consider a finite number of states, so you get the choice between only analyzing programs with finitely many states (and lose flexibility), considering only a subset of all states (and lose soundness), or over-approximating an infinite set of states by a finite set of abstract states. Abstraction, in turn, can be performed manually (this is what loop invariants are for!) or automatically (like in abstract interpretation); the former kills automation, while the latter is limited to a restricted set of properties, which kills flexibility. Naturally, acing all three dimensions has always been the holy grail of program verification. This is where all the approaches are converging, albeit taking very different routes, with each technique improving along its weak dimension little by little. Static analyzers are enriched with new abstract domains; model checkers can handle ever larger state spaces; auto-active tools compete in economy of required user annotations; proof assistants boast more and more sophisticated automated tactics. They are all working towards the same goal — still far away from the finish line — and they still differ vastly in terms of strengths and weaknesses, application domains, and user experience.

So, next time you hear someone say that their program verification technique is better that all the rest, just remember the cube.

      — Nadia