empirical is the new black

During the spring semester, I supervised three groups of students working on master’s thesis projects that I proposed. Tackling quite different problems, they all did an excellent job, and it was fun and interesting to supervise their work and to follow their results. In this post, I’m giving a brief account of their work now that they are all done. Below, you can also find links to the official thesis reports in Chalmers’s Library. While we’re at it: if you’re a current student looking for a master’s thesis project, make sure to check out the projects I currently offer or drop me a line.

The spec is out there — Extracting contracts from code

Pontus Doverstav and Christoffer Medin developed a tool that extracts partial JML contracts (a kind of formal specification) from Java code. Their approach differs from the most common techniques for specification inference, which ultimately reason about the semantics of code (for example, abstract interpretation techniques abstract the concrete semantics of a program on simpler domains). In contrast, Pontus and Christoffer’s work is syntax-driven, as it is based on a collection of patterns that capture how certain code implicitly suggests simple contracts. For example, a qualified call t.foo() implicitly requires that t != null; this assertion is propagated to the precondition of the method where the call appears.

The whole approach is deliberately kept simple and best effort. For instance, whenever there is a method call, we try to determine its effects based on the callee’s (partial) contract; if no callee contract is available, we just assume a worst-case behavior to preserve soundness. In the worst case, an analyzed method may be marked as “failure” by the tool, denoting that no meaningful specification could be reliably extracted. An advantage of keeping the analysis so simple is that we can apply it extensively and retain any result that is useful. When they applied their tool to thousands of methods in four open-source projects, Christoffer and Pontus found that it could extract more than one JML specification case per method on average. Many extracted contracts are simple, but a few are quite complex — such as one capturing in detail the behavior of a large switch statement covering 256 cases. The thesis also reports on some preliminary evidence that suggests the contracts extracted this way can be complementary to those a programmer would write, thus helping the analysis of programs and other applications driven by specifications.

Comparing programming languages in Google Code Jam

Alexandra Back and Emma Westman analyzed empirically the solutions submitted to Google’s Code Jam programming contest, with the overall goal of comparing features of different programming languages. The main motivation behind targeting Code Jam was finding a collection of programs that all implement the same functionality (whatever is required by each round in GCJ) in different programming languages (contestants can use any programming language of their choice), while being able to control for quality — the intuition being that a solution’s rank in the contest is strongly indicative of its quality in terms of correctness and efficiency.

The study focused on the five programming languages most widely used in the contest — C, C++, C#, Java, and Python — and analyzed over 210’000 solution programs. Since a contestant’s performance is based only on his or her ability to produce the correct output for a given, large input to the program, there is no guarantee that the program that has been submitted is the same as the one that was used to generate the correct solution. This complicated the study, and ultimately required to drop 38% of the solutions from some analyses, as they could not be compiled and executed. Nonetheless, a large number of usable solutions remained that could be fully analyzed, which gives confidence in the study’s results. Among them, Alexandra and Emma found that the highest-ranked contestants use C++ predominantly but by far not exclusively; that Python programs tend to be the most concise ones, and C# programs the most verbose; that C and C++ dominate the comparison of running time performance, even though the other languages do not fall that far behind; that C and C++ also have the smallest memory footprint, while Java has the largest (probably due to how the JVM manages memory). Check out the thesis for many more details and results.

Incremental deductive verification for a subset of the Boogie language

Leo Anttila and Mattias Åkesson developed a tool that verifies Boogie programs incrementally as they are modified in each new iteration. Boogie is a popular intermediate verification language, combining a simple imperative language and an expressive program logic. Boogie is also the name of a tool that verifies Boogie programs. Whenever a procedure changes implementation or specification, Boogie verifies it from scratch — like most other verifiers; in contrast, Leo and Mattias’s tool is able to reuse the results of previous verification runs. Suppose, for example, we have verified a procedure foo; later, we introduce a small change to foo that is still compatible with the procedure’s specification. The incremental verifier is aware of what has changed, and only verifies that, intuitively, the difference between the current version of foo and the previous one does not affect correctness.

The details of how incremental verification works are quite technical, and rely on a combination of attribute grammars and incremental parsing. My experience working with Boogie (or program verifiers that use it as back-end) suggests that such an incrementality matches well the way so-called auto-active verification tools are used in practice — namely in a stream of small changes to the source code each immediately followed by a call to the verifier, whose responsiveness is of the essence to support a smooth user experience. The thesis describes experiments simulating this mode of interaction, showing that incrementality can often reduce the running time of the verifier by 30% or more.

References

Christoffer Medin and Pontus Doverstav: The spec is out there — Extracting contracts from code, Master’s thesis in computer science. Department of Computer Science and Engineering, Chalmers University of Technology, Sweden, June 2017.
Alexandra Back and Emma Westman: Comparing programming languages in Google Code Jam, Master’s thesis in computer science, algorithms, languages and logic. Department of Computer Science and Engineering, Chalmers University of Technology, Sweden, June 2017.
Leo Anttila and Mattias Åkesson: Incremental deductive verification for a subset of the Boogie language, Master’s thesis in computer science, algorithms, languages and logic. Department of Computer Science and Engineering, Chalmers University of Technology, Sweden, June 2017.

Breaking news! A recent study found that Barack Obama is, with high probability, not an American citizen! The study — destined to revive the controversy that emerged during the President’s first presidential campaign — is based on new evidence and a simple analysis using widely accepted statistical inference tools. I’ll leave it to the political pundits to analyze the grave effects that this shocking finding surely will have on the upcoming presidential campaign. This post focuses on the elegant technical machinery used to reach the unsettling conclusion.

The crux of the analysis applies, in a statistical setting, modus tollens, a basic inference rule of logic. Given two facts $X$ and $Y$ such that if $X$ is true then $Y$ is true, modus tollens derives the falsehood of $X$ from the falsehood of $Y$ . In formal notation:

$\begin{matrix} X \Longrightarrow Y, \quad \neg Y \\ \hline \neg X \end{matrix}$

For example, take $X$ to be “It rains” and $Y$ to be “I have an umbrella with me”. From the fact that I am carrying no umbrella, by applying modus tollens, you can conclude that it’s not raining.

The next step introduces a simple generalization of modus tollens to the case where facts are true with some probability: if $X$ is true then $Y$ is true with high probability. Then, when $Y$ happens to be false, we conclude that $X$ is unlikely to be true. If I have an umbrella with me 99% of the times when it rains, there’s only a 1% chance that it rains if I have no umbrella with me.

All this is plain and simple, but it has surprising consequence when applied to the presidential case. A randomly sampled American citizen is quite unlikely to be the President; the odds are just 1 in 321-something millions. So we have that if “a person $p$ is American” (or $X$ ) is true then “ $p$ is not the President” (or $Y$ ) is true with high probability. But Mr. Barack Obama happens to be the President, so he’s overwhelmingly unlikely to be American according to probabilistic modus tollens!

(The ironic part of the post ends here.)

Sure you’re thinking that this was a poor attempt at a joke. I would agree, were it not the case that the very same unsound inference rule is being applied willy-nilly in countless scientific papers in the form of statistical hypothesis testing. The basic statistical machinery, which I’ve discussed in a previous post, tells us that, under a null hypothesis $H_0$ , a certain data $D$ is unlikely to happen. In other words: if “the null hypothesis $H_0$ ” is true then “the data is different than $D$ ” is true with high probability. So far so good. But then the way this fact is used in practice is the following: if we observe the unlikely $D$ in our experiments, we conclude that the null hypothesis is unlikely, and hence we reject it — unsoundly! How’s that for a joke?

Having seen for ourselves that modus tollens does not generalize to probabilistic inference, what is a correct inference from data to hypothesis testing? We can use Bayes’s theorem and phrase it in terms of conditional probabilities. $P(X \mid Y)$ is the probability that $X$ occurs given that $Y$ has occurred. Then $P(H_0 \mid D)$ — the probability that the null hypothesis $H_0$ is true given that we observed data $D$ — is computed as $P(D \mid H_0) \cdot P(H_0) / P(D)$ . Even if we know that $D$ is unlikely under the null hypothesis — $P(D \mid H_0)$ is small — we cannot dismiss the null hypothesis with confidence unless we know something about the absolute prior probabilities of $H_0$ and $D$ . To convince ourselves that Bayes’s rule leads to sound inference, we can apply it to the Barack Obama case: $H_0$ is “a person $p$ is American” and $D$ is “ $p$ is the President”. We plug the numbers in and do the simple math to see that $P(H_0 \mid D)$ , the probability that the President is American, is indeed one:

$(1 / A) \cdot (A / W) / (1 / W) = 1$ , where $A$ is the population of the USA and $W$ is the world population. Bayes 1 – birthers 0.

Now you understand the fuss about statistical hypothesis testing that has emerged in numerous experimental sciences. Sadly, this blunder is not merely a possibility; it is quite likely that it has affected the validity of numerous published experimental “findings”. In fact, the inadequacy of statistical hypothesis testing is compounded by other statistical results such as the arbitrariness of a hard-and-fast confidence threshold, the false hypothesis paradox (when studying a rare phenomenon, that is a phenomenon with low base rates, most positive results are false positives), and self-selection (the few research efforts that detect some rare phenomenon will publish, whereas the overwhelming majority of “no effect” studies will not lead to publication). In an era of big data, these behaviors are only becoming more likely to emerge.

The take home message is simple yet important. Statistical hypothesis testing is insufficient, by itself, to derive sound conclusions about empirical observations. It must be complemented by other analysis techniques, such as data visualization, effect sizes, confidence intervals, and Bayesian analysis. Unless you remain convinced that Obama’s not American, Elvis is alive, and the Apollo moon landings were staged. In this case, this blog is not for you — with high probability.

An empirical study comparing LaTeX to Word for preparing scientific documents appeared on PLOS ONE in mid December, just in time to ignite an intense online discussion that took place over the holiday season. Among the many comments that I read, one tweet by Michele Lanza made me realize that the study was more general than I originally thought:

The Word vs. LaTeX efficiency argument is in line with microwave food vs. actual cooking

What a terrific suggestion for related research! Back from the winter break, I decided to adapt the design of the LaTeX vs. Word study in order to pit microwave food against proper cooking. The results are, I believe, quite conclusive; and my recommendations based on them compelling. While I submitted a detailed write-up for publication, I want to give you a summary of the essential findings in the rest of this post. In my empirical research, I stuck to the original study’s design as closely as possible since it seemed as much apropos for comparing cooking methods as it was for comparing document preparation systems. In the write-up that follows you will recognize passages that mirror closely the structure and even the words of the original LaTeX vs. Word paper, since those words speak for themselves — and imitation is the sincerest form of flattery.

The study

This empirical study compares microwave usage to full-fledged cooking for food preparation. The experimental methodology was straightforward. We gathered 40 volunteers who use to cook their dinners — some using a microwave to defrost preprocessed food, some cooking from raw ingredients following a recipe. All participants used their own kitchen to run the experiments.

We set up three different sample meals: (1) a TV dinner with two food compartments; (2) a different brand of TV dinner with four food compartments; and (3) a dish consisting of 100 grams of spaghetti aglio e olio. Each participant had 30 minutes to cook each meal using his or her chosen food preparation technique, process, and equipment. Who chose full-fledged cooking was given access to a repository of raw ingredients; and who chose microwaving was given a supply of common TV dinners of different brands. The performance of each participant was measured for each sample meal by three variables: (1) the number of visual differences (food layout, appearance, color) between the cooked meal and the sample; (2) the number of differences in flavor between the cooked meal and the sample; and (3) the amount of hot edible mass produced within 30 minutes. Each participant also completed a questionnaire where they self-evaluated their performance.

The experimental results are unequivocal. Microwave users outperformed traditional cooks on most measures ( $\text{[math]}$ p < 0.05[/latex], often even [latex]p < 0.01[/latex]), with the only exception of the spaghetti dish. Even the most expert cooks were unable to reproduce, from raw ingredients, TV dinners that look and taste like the sample TV dinners, in contrast to the novice microwave users who managed to effortlessly heat to near perfection large amounts of prepackaged food.

These results suggest that full-fledged cooking should be adopted only to prepare complex dishes mainly consisting of oil and garlic. Experienced cooks may argue that the overall quality and healthiness of properly cooked food provides for a meal experience that is considerably more enjoyable than ingurgitating a tasteless, fat-laden, appalling microwave dinner. Although this argument may be true, the differences with the recently introduced versions of TV dinners may be a tad less spectacularly obvious than they were in the past. Thus, the results show that no reasons exist to use traditional means of cooking, except possibly for dishes that contain complex combinations of olive oil and pasta.

An unexpected result of the study was that, according to the questionnaire’s answers, full-fledged cooks are highly satisfied with their experience. Despite incurring reduced usability and productivity, they assessed their work as less tiresome, less frustrating, and more enjoyable than microwave users. From a psychological perspective, the most reasonable explanation is that the cooks are delusional lunatics who are unwilling to reconsider their manifestly incorrect beliefs about their cooking ability in light of their actual poor results in faithfully reproducing low-grade industrial food.

The study’s results also have implications in terms of costs of food preparation and consumption by the public. Individuals have a responsibility to act economically and efficiently, especially in cases in which their occupation is publicly funded. No reliable data is available about how many publicly-employed workers cook their own meals, and correspondingly it is unclear the amount of taxpayer’s money that is spent worldwide by individuals who stubbornly insist on cooking food from raw ingredients over sticking to a more efficient and modern meal preparation system, which would free up their time to advance their respective field of occupation.

I therefore suggest that leading public institutions should consider accepting time-squandering food preparation practices by their employees only if this is justified by the prevalence of dishes involving spaghetti and garlic. In all other cases, said institutions should request employees to eat microwave food (or order take out). We believe that this would be a good policy for two reasons. First, the flavor and appearance of food is secondary to its nutritional values. And, second, preventing people from frittering away scarce culinary resources would save time and money to maximize the benefit of work and development for both individual institutions and the public.

P.S.: Some readers suggested two additional aspects in which microwave cooking is superior. First, adjusting the heating power is much easier with a microwave (where pushing a button immediately interrupts the flow of radiation) than with a traditional stove (where the stove’s surface may remain hot for minutes even after power is turned off). And, second, crispy food can be properly cooked using the recently introduced hot air circulation system available in several high-end microwave ovens.

This summer, Sebastian Nanz and I have finally figured out what the best programming language is. The answer is…

Of course you immediately understood that the incipit is a joke. When it comes to complex feature-laden artifacts like general-purpose programming languages there is no such thing as the best tool for the job. In the reality of software development, different programmers with different skills, different attitudes, and different mindsets solve different problems following different methods, different practices, and different processes in different environments, using different tools and different programming languages. As a result, each programming language design strives to find trade-offs that are convenient to someone in the motley crowd of programmers.

Still, the trade-offs of different languages should be demonstrable by impacting measurable features of programs written in those languages. In this recent work [Nanz and Furia, 2014], we have tried to contribute empirical evidence to better our understanding of how programming languages can be used in practice. One of the aspects we were interested in investigating was whether one can find empirical evidence to justify some of the folk knowledge about programming languages, which is very often passed on as a series of ipse dixit that should be self-evident — except that sometimes different authorities have dissenting opinions!

Before summarizing the results of our study, here’s something about the methodology. An important decision was to use Rosetta Code as source of raw data in the form of programs. Rather than hosting full projects — a service provided by other sites such as GitHub and Bitbucket — Rosetta Code focuses on well-defined programming tasks that can be implemented by small programs (the average program in our study is around 30 lines of code); this makes implementations of the same task in different programming languages directly comparable. The revision history and submission practices of Rosetta Code also suggest that programs are often revised by multiple programmers, and hence likely have a good quality on average; and the task list includes many relevant problems that are often part of large real-world projects. This setup helped make sound inter-language comparisons based on proper language usage, thus reducing dispersion and bias in the data. Based on a combination of their popularity in TIOBE and Rosetta Code, we selected 8 languages in four categories: C and Go as procedural languages; C# and Java as object-oriented languages; F# and Haskell as functional languages; and Python and Ruby as object-oriented languages. If your favorite language is not there do not despair: let us know in the comments why you think it deserves to be included; we might consider it for future work.

Let’s have a look at the main results (see the paper for all the details). The biggest surprise is that there are no huge surprises: well-trained programmers and software engineers will recognize several well-known adages about the advantages of certain programming language features over others. To make this apparent, I’ve collected excerpts from classics texts on programming languages that somewhat match our empirical findings.

Conciseness

It is generally understood that practical expressiveness boils down to conciseness:

The main benefit of the use of expressive languages seems to be the ability to abstract from programming patterns with simple statements and to state the purpose of a program in the concisest possible manner.

We have come to believe that the major negative consequence of a lack of expressiveness is the abundance of programming patterns to make up for the missing, non-expressible constructs.

[Felleisen, 1991]

Higher-order features such as list comprehensions, reflection, higher-order functions, and idiomatic support for lists and maps should increase the level of practical expressiveness, and hence conciseness:

Higher-order procedures can serve as powerful abstraction mechanisms, vastly increasing the expressive power of our language. [Pg. 75]

[…] expressive power […] is attained […] by accumulation and filtering [on lists]. [Pg. 81]

Elevate the conceptual level at which we can design our programs [means enhancing] the expressive power of our language. [Pg. 108]

[Abelson and Sussman, 1996]

Such higher-order features are more readily available in functional and scripting languages than imperative languages:

Against Java, we can say that (compared to, say, Python) some parts of it appear over-complex and others deficient.

[Pg. 340 in Raymond, 2003]

We measured conciseness in terms of lines of code, comparing solutions in each language against those in other languages. Our numbers draw a picture that is largely consistent with the above quotations: functional and scripting languages provide significantly more concise code than procedural and object-oriented languages. Their higher-order features increase practical expressiveness, to wit, conciseness. While in principle one can follow a functional style using object-oriented languages, it is idiomatic support that seems to make a tangible difference in practice.

Performance

Performance is another often controversial programming language feature. We tried to contribute to the understanding of performance in practice by distinguishing between two kinds of tests. Much of the controversy when discussing performance may derive from conflating these two kinds of problems, which represent very different conditions.

The first kind of performance comparison targets raw execution speed on large inputs; for example, sorting million-element arrays or compressing tens of megabytes of data. The outcome of our experiments using Rosetta Code tasks on such problems is what most people would expect: C is indisputably the fastest — if it was a race, it’d lap all other languages. A bit more generally, language features cost raw speed, and more features tend to cost more speed. In fact, the only runner-up (still from a distance) is Go, a language that is richer than C — it offers automatic memory management and strong typing — but deliberately renounces other expressive features, such as inheritance and genericity, that have become commonplace in modern high-level programming languages.

Programs that require maximum speed […] are good candidates for C. [Pg. 326]

Python cannot compete with C or C++ on raw execution speed. [Pg. 337]

[Raymond, 2003]

[The] main problem [of automatic memory management], however, is that “useful” processing time is lost when the garbage collector is invoked.

[Pg. 168 in Ghezzi and Jazayeri, 1997]

Most of the time, however, the extreme differences in raw speed that emerge with algorithmically-intensive programs on large inputs do not matter much because such jobs are absent or extremely infrequent in the overwhelming majority of applications, which hardly ever have to deal with number crunching. How many million-element arrays did your web browser have to sort while you were browsing the news? To understand performance differences more commonly occurring in everyday conditions, we identified a second kind of targets for comparison, consisting of well-defined problems on input of moderate size, such as checksum algorithms and string manipulation tasks. The results are quite different when we consider this second kind of everyday problems. Scripting and functional languages tend to emerge as the fastest, even surpassing C. More generally, the absolute differences between languages are smallish, which means that every language is usable, and engineering concerns other than raw speed emerge as more relevant.

Most applications do not actually need better performance than Python offers.

[Pg. 337 in Raymond, 2003]

To sum up, the most significant, and somehow neglected, finding that surfaced from our performance comparisons is this distinction between “raw speed” and “everyday” performance requirements, and the corresponding emphasis on the latter for most workaday programming tasks.

Failure proneness

Counting faults (or, more appropriately for this blog, bugs) is often used to measure the quality or success of software projects. The errors that are of principal interest in that context are those resulting from program behavior that diverges from specification; for example, a banking application that increases your balance when you withdraw money (although this is a bug most of us could live with 🙂 ). In contrast, our comparison of programming languages looked for errors that are independent of a program’s specification and have to do with what we might characterize as well-formedness, such as typing and compatibility errors. This is an admittedly restricted notion of error, but it lends itself to effective detection and analysis. The classic view on the problem of detecting such errors is clear:

[Checking for the presence of well-formedness errors] can be accomplished in different ways that can be classified in two broad categories: static and dynamic. […] In general, if a check can be performed statically, it is preferable to do so instead of delaying the check to run time […].

[Pg. 137 in Ghezzi and Jazayeri, 1997]

To see which languages follow this prescription and tend to perform more checks statically, we counted what fraction of programs in each language that compile correctly terminate without error (exit status zero). The compiled strongly-typed languages (that is, all compiled languages but C which is weakly typed) clearly emerged as those with the fewest runtime failures triggered in our experiments; their compilers do a fairly good job at catching errors at compile time by type checking and other static analyses. In contrast, the interpreted languages triggered runtime failures more frequently; nearly all checks but syntactic ones are done at runtime, when things can go wrong in many different ways.

Go was the least failure prone of the compiled strongly-typed languages. Given that we analyzed a large number of programs written by different contributors, we are reluctant to explain this difference mainly by attributing better programming skills to the authors of Go programs in our sample. Instead, this results refines the picture about what compiled strongly-typed languages can achieve: Go’s type system is more restricted than that of functional or object-oriented languages, which may help achieve type safety by minimizing dark corners where errors may originate.

Our results on failure proneness cannot set the debate about static vs. dynamic checks. This is another issue where a multitude of technical and organizational concern concur to creating several different local optima.

Be fruitful and replicate

No single study, no matter how carefully designed and executed, can claim to provide conclusive evidence about complex issues such as the relative merits and practical impact of programming language features. Scientific inductive knowledge grows slowly, one little piece of evidence at a time. Technological factors further complicate the picture, since they may shift the importance of certain features and render some obsolete (for example, the exponential growth of processor speed in past decades has favored the development of feature-rich graphical user interfaces that were previously impractical). Yet, it is encouraging that we have found significant concordance between our empirical results and others’ point of view. A lot of what you learned from the classics about programming language was right — as long as you picked the right classics!

References

Sebastian Nanz and Carlo A. Furia: A comparative study of programming languages in Rosetta Code. Technical report arXiv.org:1409.0252, September 2014.
Matthias Felleisen: On the expressive power of programming languages. Science of Computer Programming, 17(1-3):35-75, 1991.
Harold Abelson, Gerald Jay Sussman, and Julie Sussman: Structure and interpretation of computer programs. 2nd edition, MIT Press, 1996.
Eric S. Raymond: The art of UNIX programming. Addison-Wesley, 2003. Available online.
Carlo Ghezzi and Mehdi Jazayeri: Programming language concepts. 3rd edition, Wiley & Sons, 1997.

Bug counting

A little specification can go a long way

Category Archives: empirical is the new black

Master’s thesis project showcase

The spec is out there — Extracting contracts from code

Comparing programming languages in Google Code Jam

Incremental deductive verification for a subset of the Boogie language

References

If Obama’s not American all our stats are safe and sound

An efficiency comparison of food preparation systems

The study

When speed doesn’t matter, Python is faster than C

Conciseness

Performance

Failure proneness

Be fruitful and replicate

References