Do you trust nuclear energy? I know, it’s a complex question — not the kind that easily leads to a “yes” or “no” answer. Let me rephrase it into a possibly less controversial form: do you trust nuclear engineers? Now, the odds are turning, and there’s a good chance Joe or Jane Sixpack would answer in the affirmative, without feeling the need to add qualifications to their answer.
Indeed, for all the negative stereotyping about being boring or uncreative, engineers have a generally solid reputation among the general public of being trustworthy and competent professionals. Well, I mean, traditional engineers have. Those who build bridges, buildings, and power plants. Those are the guys you can trust. Ah, if only software development were a solid engineering discipline in the same league with civil, mechanical, and nuclear engineering! We wouldn’t have to deal with lousy apps or insecure systems anymore! We could enjoy the same level of reliability and safety of, say, nuclear power which, in the immortal words of Homer J. Simpson, “has yet to cause a single proven fatality.”
Seriously, I don’t want to compare apples and oranges, but it is true that software developers may enjoy a generally poorer reputation than other professionals. The latest outcry follows an all too familiar argument: 1. Software is spectacularly unreliable; 2. Traditional engineering delivers highly reliable products; 3. Ergo, software development should adopt engineering practices (or, as the article goes, stop calling itself software engineering).
I will not pick on that particular article, even though it’d be a particularly easy target given its technical shallowness and shaky arguments (how is the dieselgate an indictment of software’s low quality?). Instead I’m interested in the more general issue of what traditional engineering has to teach to software engineering. I also want to avoid going down the line of nominalism — which all too often emerges in debating these kinds of topics — looking for the “essence” of engineering, software development, or the like. That’d be neither useful nor particularly interesting.
First of all, I believe that the gap of reliability between the best software systems and the best artifacts produced by traditional engineering is much smaller than the common opinion would lead us to believe. In fact, the line between the two is blurry, in that some of the most reliable complex software lies at the core of airplanes, cars, or control plants. If the latter are reliable, it is largely because the former is. As another, historical but still very significant, example see Feynman’s report on the Challenger disaster: among the various engineering divisions, the one responsible for avionics software is the only one whose practices pass with flying colors: “the computer software checking system and attitude is of the highest quality.” Of course most software development does not come even close to complying with the standards of quality of avionics software; but neither is the engineering of the majority of traditional consumer products with no mission-critical requirements.
Another feature traditionally ascribed to traditional engineering as opposed to software engineering is a rigorous, systematic approach that does not leave anything to chance. A nuclear engineer, for example, knows exactly what goes on in every stage of development and deployment of a nuclear power plant, and knows how to ensure that no accidents occur. Except this is not exactly true. It turns out that [Mahaffey, 2015] the riskiest activity related to harnessing nuclear power is fuel (re)processing, where fissile fuel is produced by chemical means. The major risk is that some of the fuel becomes critical, that is capable of sustaining a spontaneous chain reaction. Criticality does not only depend on the amount of fuel but also on its geometric configuration. It is practically impossible to predict every possible configuration the fuel may take while being processed, and in fact several accidents occurred due to unexpected accumulation of material in what turned out to be criticality-inducing configurations. This does not mean that the whole enterprise is left to chance, only that perfect planning is unattainable and one must deal flexibly with uncertainties. The best practices are to ensure that the operators are thoroughly trained not only in the procedures but are also fully aware of the general issues that may arise and on the lookout for unpredicted sources of danger. This attitude towards prevention and awareness was first adopted in the fuel processing plants of the Manhattan Project, when a young Feynman (him again) pointed out to his superiors that the excessive secrecy initially imposed, which prevented workers from knowing what exactly they were doing and what risks they could encounter, was counterproductive and foolishly dangerous. Prevention is the best protection, and prevention requires knowledgeable people, not drones.
Which leads us to the other main point: what can software engineering really learn from traditional engineering — and nuclear engineering in particular? It’s not the scientific foundations: informatics provides rock-solid foundations. It’s not so much the institutionalization as a licensed profession: while it may be useful in certain contexts (for example for freelance software developers) it’d have little or no relevance in others. The attitude that software engineering can learn from nuclear engineering is what to in the aftermath of an accident.
When something goes wrong in a nuclear reactor and it gets scrammed, it is imperative that the dynamics of the accident be understood down to minute details. This involves understanding the physics of the reaction, any failure of the equipment against its supposed behavior, how the personnel reacted, what practices were followed up to the time of the accident that may have altered operating conditions or generally increased risks. Based on an exhaustive post mortem, procedures, technologies, equipment, and training are revised. These activities have top priority, and must be completed to satisfaction before operations can restart. The net effect is that the chances of the same kind of problem occurring twice are minimal, and reliability improves by building on an ever growing knowledge base. (Bertrand Meyer made similar remarks about aerospace engineering.)
Two final comments. First, I believe such kind of practices are already in place, in some form at least, in the best software development environments. Besides safety-critical software, where we have several well-known case studies, the practice of large-scale software development includes inspections, regressions, and rigorous analysis. Even Facebook has given up on their “move fast and break things”, and realized the importance of stable infrastructure and rigorous analysis. Second, engineering is ultimately all about trade-offs. If you’re developing a silly app, you may just accept that “mediocre” is good enough. And that what you’re doing is not quite “engineering”.
- James A. Mahaffey: Atomic Accidents: A History of Nuclear Meltdowns and Disasters: From the Ozark Mountains to Fukushima. Pegasus, 2015.