Bug counting « A little specification can go a long way

It was the best of times, it was the worst of times, it was the age of fastness, it was the age of sluggishness, it was the epoch of abstraction, it was the epoch of detail, it was the season of generality, it was the season of specificity, it was the spring of right, it was the winter of wrong, we had improving performance before us, we had degrading performance before us, we were all going direct to where Moore predicted, we were all going direct the other way — in short, the period was so far like the present period, that some of its most influential experts insisted on its being received, for good or for evil, in the superlative degree of comparison only.

Sometimes I have the impression we’re experiencing a collective delusion regarding how we’re dealing with the end of Moore’s law — the prediction that the density of transistors in integrated circuits grows exponentially over time.

The growth predicted by Moore’s law has gone on for decades, providing a veritable, Lucullan free lunch to the entire information technology industry. But now this bonanza is running out: the density growth is slowing down, and improvements in miniaturization do not directly translate into faster clock rates.

That exponential growth is not sustainable in the long term should come as no surprise to any trained engineer. In Moore’s words: The nature of exponentials is that you push them out and eventually disaster happens. Or, using Linus Torvald’s unmistakably caustic style (talking about a different exponential growth): Unending exponential growth? What drugs are those people on? I mean, really.

Rather than coming to terms with the harsh reality — the party is over — we prefer to sugar the pill and tell a less dramatic story. Sure, the clock rate of microprocessors has not increased significantly in a while, but the progress of electronics technology is still spectacular, and is giving us chips with more and more computational units crammed onto the same circuit. So, the updated story goes, the new challenge is exploiting the increasingly parallel hardware by writing increasingly concurrent software. Of course, concurrent software is notoriously hard to get right, but software technology is up to the challenge. More of Moore through concurrency.

But can concurrent programming really let us keep on riding the exponential speedup wave? There’s no question that improving the state of concurrent programming is a fascinating research challenge, both intellectually and practically. But I’m afraid the end of Moore’s law is more disruptive than we are willing to admit. And its long-term consequences cannot be countered simply by better programming.

The elephant in the room is that we know of only a handful of algorithms and programs that can be massively parallelized. In most cases, there is a significant fraction of the computation that is intrinsically sequential, which drastically limits, per Amdahl’s law, the speedup that can be obtained by adding computing cores. You can have the best concurrency model in the world, a flawless implementation, and as many cores as you like, but still be nowhere near anything like exponential speedup (in fact, not even linear speedup!).

Here’s what some Don Knuth said about it in a 2008 interview that should be quoted more often.

To me, it looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks! […]

Let me put it this way: During the past 50 years, I’ve written well over a thousand programs, many of which have substantial size. I can’t think of even five of those programs that would have been enhanced noticeably by parallelism or multithreading. […]

[Hardware vendors] think a magic bullet will come along to make multicores speed up my kind of work; I think it’s a pipe dream. […]

From the opposite point of view, I do grant that web browsing probably will get better with multicores. I’ve been talking about my technical work, however, not recreation.

I’m inclined to give him the benefit of the doubt and assume he’s right, as I have the impression the guy knows a thing or two about algorithms and programming.

I look forward, by all means, to the challenges of combining abstraction and efficiency in concurrency research. But perhaps we should not be selling concurrent programming as a potential “fix” to the end of Moore’s law. While every exponential growth eventually stops (I mean, really), we can have the more down-to-earth goal of making the most out of the new exotic hardware architectures that are being developed. May we live in interesting times.

These are good times for formal verification. Amazon Web Services is using formal specifications and model checking to support rigorous design at a very large scale. Facebook, of all companies, is getting serious about static analysis, with Monoidics’s separation logic gurus and Francesco Logozzo of Clousot fame among its recent hires. John Carmack, as a programmer, is aggressively pursuing static analysis. Only thanks to formal verification has a remote but pernicious bug in Java’s standard library sort been discovered. It’s clear that — slowly, one step at a time — the offerings of formal methods are becoming practically helpful.

Still, the transition from pure research to practical applications requires time, in formal verification as in other areas. Among the intermediate steps that have to be covered there’s dealing with instances that are realistic: not necessarily full-blown industrial-strength, but not discounting any features that are expected of a fully usable implementation. Realistic software also moves past the challenges of benchmarks, which typically focus on one or few complicated aspects while eliding features that are considered commonplace and hence not interestingly problematic.

Nadia Polikarpova‘s presentation at the Formal Methods Symposium in Oslo a few weeks ago was precisely about dealing with realistic software in verification: our paper describes the verification of EiffelBase2, a realistic general-purpose data-structure library, against its complete functional specification.

Before you roll your eyes thinking, “Seriously?! Verification of data structures in 2015? Wasn’t the problem solved in the 1970s? Or maybe earlier! And they gave them a best paper award for this stuff? Unbelievable!”, let me explain what’s new and, hopefully, interesting about this work. It is true that the amount of work on verifying data structures is overwhelming. Not only because the effort poses interesting challenges but also because data structures are a domain that is naturally amenable to formalization and analysis: everybody agrees on the expected behavior of, say, a linked stack, and it can be formalized succinctly. In other domains (reactive software for example) even specification is a challenge in terms of both “what” and “how”. Still and all, most data-structure verification work focuses on demonstrating certain aspects of a technique and is applicable to implementations written in a certain way, possibly with some restrictions, and not in others. In contrast, Nadia implemented and specified EiffelBase2 before verification (even if, naturally, some later adjustment were necessary to complete the verification effort) with the goal of providing a usable realistic container collection without compromising on any aspect (object-oriented design, API usability, implementation performance, …) that is to be expected in practice. In fact, EiffelBase2 now ships with the EiffelStudio compiler as an alternative to the old EiffelBase library. (It did not fully replace it only for backward compatibility reasons.)

Here are a few examples of EiffelBase2 features that you would expect in a usable data-structure library (think java.util or .NET Collections, or C++ STL), but which are often simplified or omitted altogether in verification research:

External iterators, with the possibility of statically checking that multiple iterators operating on the same structure are safe, that is modify it consistently.
Hash tables with arbitrary user-defined objects as keys and object-based key comparison, with the possibility of statically checking that modifying objects used as keys does not affect table consistency.
Rich, convenient APIs offering operations for searching, replacing, inserting, and removing elements, merging containers, converting between different containers, object-based comparison (i.e., comparing objects according to their content rather than by references pointing to it), and copy constructors.
Abstract specifications of data structures, consistent with the inheritance hierarchy, and provably complete for functional correctness.
Implementations that do not trade performance off for verifiability: array-based lists use ring buffers, hash tables automatically resize their bucket arrays to avoid performance degradation, and so on.

If some of these features do not sound impressive it is because we are blasé, used as we are to having them in any realistic implementations, but also happy to ignore or abstract them away when performing verification because they do not seem to introduce complex, as opposed to complicated, novel challenges. However the devil’s in the details, and some sources of complexity are emergent from the combination of seemingly plain and innocuous features. Even verifying something as workaday as copy constructors brought its own set of challenges, in ensuring that the copy is consistent with the abstract class model and with its relatives by means of inheritance.

Part of the challenge of verifying EiffelBase2 was achieving all this with a good level of automation and a reasonable amount of annotations (besides specifications) — in other words, with an effort proportional to the magnitude of the challenge. To this end, we developed a verification methodology and supporting tool (AutoProof, which I presented in a previous post) with the goals of providing flexibility and proportionality: verifying simple stuff should be straightforward, verifying complex stuff (such as EiffelBase2) should still be attainable but require a bigger annotation overhead. While I personally think that being able to interact directly with the prover for the most challenging proofs may support even greater levels of flexibility, the auto-active interaction mode of AutoProof may have hit on a sweet spot in the landscape of annotations vs. automation — at least for full functional correctness in the domain of data structure libraries.

You’re welcome to add comments about features of realistic software that are often overlooked in practical verification, as well as about other success stories of formal methods that narrowed the gap between lab and field practice.

Breaking news! A recent study found that Barack Obama is, with high probability, not an American citizen! The study — destined to revive the controversy that emerged during the President’s first presidential campaign — is based on new evidence and a simple analysis using widely accepted statistical inference tools. I’ll leave it to the political pundits to analyze the grave effects that this shocking finding surely will have on the upcoming presidential campaign. This post focuses on the elegant technical machinery used to reach the unsettling conclusion.

The crux of the analysis applies, in a statistical setting, modus tollens, a basic inference rule of logic. Given two facts $X$ and $Y$ such that if $X$ is true then $Y$ is true, modus tollens derives the falsehood of $X$ from the falsehood of $Y$ . In formal notation:

$\begin{matrix} X \Longrightarrow Y, \quad \neg Y \\ \hline \neg X \end{matrix}$

For example, take $X$ to be “It rains” and $Y$ to be “I have an umbrella with me”. From the fact that I am carrying no umbrella, by applying modus tollens, you can conclude that it’s not raining.

The next step introduces a simple generalization of modus tollens to the case where facts are true with some probability: if $X$ is true then $Y$ is true with high probability. Then, when $Y$ happens to be false, we conclude that $X$ is unlikely to be true. If I have an umbrella with me 99% of the times when it rains, there’s only a 1% chance that it rains if I have no umbrella with me.

All this is plain and simple, but it has surprising consequence when applied to the presidential case. A randomly sampled American citizen is quite unlikely to be the President; the odds are just 1 in 321-something millions. So we have that if “a person $p$ is American” (or $X$ ) is true then “ $p$ is not the President” (or $Y$ ) is true with high probability. But Mr. Barack Obama happens to be the President, so he’s overwhelmingly unlikely to be American according to probabilistic modus tollens!

(The ironic part of the post ends here.)

Sure you’re thinking that this was a poor attempt at a joke. I would agree, were it not the case that the very same unsound inference rule is being applied willy-nilly in countless scientific papers in the form of statistical hypothesis testing. The basic statistical machinery, which I’ve discussed in a previous post, tells us that, under a null hypothesis $H_0$ , a certain data $D$ is unlikely to happen. In other words: if “the null hypothesis $H_0$ ” is true then “the data is different than $D$ ” is true with high probability. So far so good. But then the way this fact is used in practice is the following: if we observe the unlikely $D$ in our experiments, we conclude that the null hypothesis is unlikely, and hence we reject it — unsoundly! How’s that for a joke?

Having seen for ourselves that modus tollens does not generalize to probabilistic inference, what is a correct inference from data to hypothesis testing? We can use Bayes’s theorem and phrase it in terms of conditional probabilities. $P(X \mid Y)$ is the probability that $X$ occurs given that $Y$ has occurred. Then $P(H_0 \mid D)$ — the probability that the null hypothesis $H_0$ is true given that we observed data $D$ — is computed as $P(D \mid H_0) \cdot P(H_0) / P(D)$ . Even if we know that $D$ is unlikely under the null hypothesis — $P(D \mid H_0)$ is small — we cannot dismiss the null hypothesis with confidence unless we know something about the absolute prior probabilities of $H_0$ and $D$ . To convince ourselves that Bayes’s rule leads to sound inference, we can apply it to the Barack Obama case: $H_0$ is “a person $p$ is American” and $D$ is “ $p$ is the President”. We plug the numbers in and do the simple math to see that $P(H_0 \mid D)$ , the probability that the President is American, is indeed one:

$(1 / A) \cdot (A / W) / (1 / W) = 1$ , where $A$ is the population of the USA and $W$ is the world population. Bayes 1 – birthers 0.

Now you understand the fuss about statistical hypothesis testing that has emerged in numerous experimental sciences. Sadly, this blunder is not merely a possibility; it is quite likely that it has affected the validity of numerous published experimental “findings”. In fact, the inadequacy of statistical hypothesis testing is compounded by other statistical results such as the arbitrariness of a hard-and-fast confidence threshold, the false hypothesis paradox (when studying a rare phenomenon, that is a phenomenon with low base rates, most positive results are false positives), and self-selection (the few research efforts that detect some rare phenomenon will publish, whereas the overwhelming majority of “no effect” studies will not lead to publication). In an era of big data, these behaviors are only becoming more likely to emerge.

The take home message is simple yet important. Statistical hypothesis testing is insufficient, by itself, to derive sound conclusions about empirical observations. It must be complemented by other analysis techniques, such as data visualization, effect sizes, confidence intervals, and Bayesian analysis. Unless you remain convinced that Obama’s not American, Elvis is alive, and the Apollo moon landings were staged. In this case, this blog is not for you — with high probability.

Bug counting

A little specification can go a long way

Not any-Moore

Data-structure verification: down to the nitty-gritty

If Obama’s not American all our stats are safe and sound

Dante’s writing tips